I have the following (real world!) problem which is most easily described using an example.
I ask my six students when the best time to hold office hours would be. I give them four options, and say that I’ll hold two hours in total. The poll’s results are as follows (1 means yes, 0 means no):
$$\begin{array}{l|c|c|c|c}
\text{Name} & 9 \text{ am} & 10 \text{ am} & 11 \text{ am} & 12 \text{ pm} \\ \hline
\text{Alice} & 1 & 1 & 0 & 0 \\
\text{Bob} & 1 & 1 & 0 & 1 \\
\text{Charlotte} & 1 & 1 & 0 & 1 \\
\text{Daniel} & 0 & 1 & 1 & 1 \\
\text{Eve} & 0 & 0 & 1 & 1 \\
\text{Frank} & 0 & 0 & 1 & 0 \\ \hline
\text{Total} & 3 & 4 & 3 & 4
\end{array}$$Naively, one might pick columns 2 and 4, which have the greatest totals. However, the solution which allows the most distinct people to attend is to pick columns 2 and 3.
In practice, however, I have ~30 possible timeslots and over 100 students, and want to pick, say, five different timeslots for office hours. How do I pick the timeslots which maximise the number of distinct students who can attend?
Answer
In general, this problem is the maximum coverage problem, which is NP-hard, so you’re not going to be able to find the optimal solution by any method substantially faster than brute force. (As I mentioned, for a problem of your size, brute force is still feasible.)
However, a modification of your strategy (to choose the timeslots with the highest totals) performs reasonably well.
It doesn’t actually make sense to choose all the timeslots with the highest totals immediately: if they all have the same students in them, then this isn’t any better than just choosing one of the timeslots. Instead, we can:
- Choose one timeslot with the highest total number of students.
- Remove all students that can make it to that timeslot from consideration. Recalculate the totals for the remaining students.
- Repeat steps 1-2 until you have chosen $k=5$ timeslots.
According to the wikipedia article, this algorithm achieves an approximation ratio of $1 – \frac1e$; that is, if you can reach $N$ students with the optimal choice of timeslots, this algorithm will reach at least $(1 – \frac1e)N$ students.
There are other possible approximate solutions out there; for example, there is the big step heuristic, which considers all choices of $p$ timeslots in each step, as opposed to all individual timeslots. (When $p=1$, it is the algorithm above; when $p=k$, it is just brute force.) No fast algorithm has guaranteed performance better than $(1 – \frac1e)N$, but they may perform better in some cases.
Attribution
Source : Link , Question Author : lokodiz , Answer Author : Misha Lavrov