Sunday, April 8, 2012

Wine Tasting and the Probability of Murphy's Law

When I was still in secondary school, I encountered a situation in which a person was at a wine competition - the kind where you are supposed to guess which wine is which, as opposed to ranking them based on taste - and failed to recognize all of the four wines.

It works like this: in front of you, you see four bottles and four glasses of wine. You don't know which glass of wine belongs to which bottle, only that they are poured from all four bottles. Your job is to drink the wine glasses and guess which wine is which, by assigning each wine to a unique bottle.

Assuming you don't know anything about wines, what is the probability that you will get all guesses wrong?
In other words: what's the worst case probability of complete failure?

I solved the problem with a friend by drawing a probability tree for conditional probabilities, and came to the conclusion that the probability is surprisingly high: 37.5% to be exact. One way to see this is as follow:
Assume the bottles have labels 1, 2, 3, 4. You now need to put them in the right order, which we may simply assume to be 1234 (of course, the wine taster doesn't know that, and in practice labels are names, not numbers).

To make things a bit simpler, consider the case where we have 3 bottles of wine:
Here's all the possible 3!=6 permutations:
123, 132, 213, 231, 312, 321.
Of these, here's the ones that are completely deranged, that is, every label has been put at the wrong place:
231, 312.
So we have 2 derangements out of 6 possible permutations, and the probability of getting them all wrong is 2/6 = 33%.

This approach works for solving our case with 4 bottles, but it does not generalize to more bottles: for every situation, we need to recompute all permutations, and then find all derangements by trying them all. Except for being extremely tedious (computationally expensive), it does not give much clarity into the problem.

For example, if our wine competition has 10 bottles, what's the probability of guessing them all wrong?
There are 10!, which is more than 3.6 million permutations, of which we need to manually check whether it's deranged. That's gonna take a while, even with a computer program.

Besides from being inefficient and uninsightful, there's a third problem with this approach: it does not tell us what happens when the number of bottles increase to large numbers. Is it almost surely the case that we are doomed to fail? Or succeed? With 3 bottles, our probability of complete failure is 33%. With 4 bottles, it is 37.5%. So you might think that it keeps increasing up to 100%, that is, almost surely complete failure.

Here's a solution to the general problem which is pretty straight forward. It does not assume you know about the inclusion/exclusion principle (which will otherwise give a short and elegant solution too); instead it produces a recursive formula which is then simplified.

Let's call the number of derangements, given k bottles, d(k). If we have only one bottle, we can't guess it wrong, so d(1) = 0. If we have two bottles, there's one way to guess it right (12) and one to guess it wrong (21), so d(2) = 1.

Now, consider the following case: for our first choice, we drink wine 1 and guess it belongs to the bottle numbered c. There are k bottles, but we must get it all wrong, so we can't assign c = 1. Hence we have k-1 choices. Now, it doesn't matter in which order we drink and guess the wines (we just need to get them all wrong), so let's say we drink glass c next.

We will now break it down into two cases, so we have the following formula so far:

$d(k) = (k-1)(simpleCase + otherCases)$

The simplest case is if we guess that glass c belongs to bottle 1, so that the two errors cancel each other out. This observation is illuminating, because it shows what derangements are all about: once we get one bottle wrong, we necessarily get another one wrong too, since we are mixing up at least two bottles. For example, if you guess that wine 1 belongs to bottle 2, then not only is that a mistake, but you have no chance of guessing that wine 2 belongs to bottle 2, since you've already picked bottle 2.
In the simplest case, you are guessing that wine 2 belongs to bottle 1, so that you swapped (mixed up) bottle 1 and 2. Now, none of the other bottles are affected by this mistake, so after having mixed up two bottles, you end up in a situation identical to what you had in the beginning, except there's now k-2 bottles left:

$d(k) = (k-1)(d(k-2) + otherCases)$

In the other cases, we did not swap two bottles. Instead, what should have been the first bottle became bottle c, and wine c is paired with any other bottle but the first one. We are thus propagating our error forward. What's interesting here is the following: we have already chosen the wrong bottle for the first wine (bottle c instead of 1). If wine c is not going to be paired with the first bottle, there's precisely one option which c cannot have: c itself (since we would not have a derangement then). c can go with any other bottle, as long as it is not c (and we already know it is not 1, since that belongs to the simple case).
But this is precisely true of every other wine: we can select any bottle for them, except the correct one. So we have the same situation as in the beginning, except with k-1 bottles left:

$d(k) = (k-1)(d(k-2) + d(k-1))$

This is a recursive formula that allows us to compute the number of derangements for any number of bottles in a much faster way that enumeration. Since d(1)=0 and d(2)=1, we can retrieve our previous result for 3 bottles in the following way:

$d(3) = 2(d(1) + d(2)) = 2$

And for my high school problem:

$d(4) = 3(d(2)+d(3)) = 9$

The probability of getting them all wrong is  $d(4)/4! = 0.375$.

This formula is nice, but still quite slow: to get the number of derangements for a certain number of bottles k, we need to compute all previous number of derangements. A better way would be to get rid of the recursion entirely, and we can do this. Rewrite

$d(k) = (k-1)(d(k-2) + d(k-1))$

into the equivalent

$d(k)-kd(k-1) = -\big ( d(k-1)-(k-1)d(k-2) \big )$

Now, if we set the left hand side to a function f(k), we see that:

$f(k) = -f(k-1)$

Since  $f(2) = d(2)-2d(1) = 1$,  we get an alternation between 1 and -1. This can be written as

$f(k) = (-1)^k$

So we now have the simplified formula:

$d(k)-kd(k-1) = (-1)^k$

This relation is a lot nicer than the previous one, although it is still recursive. To solve it, we note that the left hand side would be a telescope sum, if it wasn't for the factor k. This annoyance is easily remedied by dividing everything by k!:

$\frac{d(k)}{k!}-\frac{d(k-1)}{(k-1)!} = \frac{(-1)^k}{k!}$

We can now sum both sides from k = 2 to n (or equivalently, use a characteristic function):

$\sum_2^n \left( \frac{d(k)}{k!}-\frac{d(k-1)}{(k-1)!}\right) = \frac{d(n)}{n!}-\frac{d(1)}{1!} = \sum_2^n \frac{(-1)^k}{k!}$

Since d(1)=0 and the terms for k=0 and k=1 cancel out, we arrive at an elegant solution:
$\frac{d(n)}{n!} = \sum_0^n \frac{(-1)^k}{k!}$

This is a non-recursive formula that computes the derangements (or probability of getting them, depending on which side we put n!) efficiently. Moreover, we note that the right hand side is the partial sums of the maclaurin expansion of 1/e. So as n tends to infinity (that is, as the number of wine bottles grows to large numbers), we see that

$\frac{d(n)}{n!} \rightarrow \frac{1}{e} \approx 0.37$

In words: as the number of bottles grows, our risk of complete failure approaches 37%.

This is a good solution to the problem, but we can in fact do even better than this. The formula for d(n) is non-recursive but it still needs to sum across all k=0,...,n and compute factorials. Going back to the maclaurin expansion, we can estimate the error bound using the Lagrange remainder:

$n!\left|\frac{1}{e}-\sum_0^n \frac{(-1)^k}{k!}\right| \leq n! \frac{e^\epsilon}{(n+1)!} \leq \frac{1}{n+1} \leq 0.5$

The error between using the sum and simply taking n!/e is less than one half, so we can compute the derangements by simply rounding to the nearest integer:
$d(n) = round\left(\frac{n!}{e}\right)$
This shows that the number of derangements is always close to a third (= 1/e) of the number of permutations.

For example:

$d(4) = round(8.8\ldots) = 9$, and

$\frac{d(4)}{4!} = 9/24 = 0.375$

This is the original problem with 4 bottles: the risk of being really embarrassed is 37.5%.

If we have 10 bottles, the number of ways to get everything wrong is:

$d(10) = round(10!/e) = round(1334960.92) = 1334961$

And the probability of total embarrassment is:

$\frac{d(10)}{10!} = 1334961/10! \approx 0.368$

As expected, the probabilities are all quite close to the limit of 37%.

It would seem then, the probability of everything going completely wrong is an almost constant 37%. No wonder it has a name: Murphy's Law.