Sunday, February 17, 2013

Misconceptions you may have

If the Universe is infinite, there must be a copy of me out there.

False. An infinite universe doesn't mean that everything necessarily repeats.

The incorrect assumption here is that all repetitions must occur. If the universe is infinite but this is the only populated planet (all others contain no life, or at least no human life), then clearly there is no copy of you.

What is true is that if there are only a limited number of configurations per space unit (say, all possible quantum states), then at least one such configuration must repeat infinitely in an infinite universe. However, it is possible that the state of total vacuum repeats infinitely much, so that outside a finite region of the universe (our observable one), there is only empty space. Universe is infinite, but most of it is empty and there is only one copy of you.

Whether these scenarios are plausible or not is another question, but logic and the laws of physics (as we understand them today) doesn't force multiple copies of you just because the universe is infinite.

If something has non-zero probability of occurring, it will eventually happen for sure.


Well, if the event has fixed non-zero probability (it doesn't change with time), then the event is bound to happen eventually.

If, however, the event has a time-dependent probability (it changes with time, as is usually the case in real life), this is false, even if the probabilities are always non-zero. The trick is that if the time-dependent probability is lowered fast enough, there is not necessarily a 100% chance of it happening at any point in time. Consider the event "World War breaks out", with probability     of breaking out in year k (it may happen more than once, but the events across years do not affect each other for the sake of simplicity). For the first year, the probability is .  In the second year, the probability drops to , then to   in the third year, and so on. The probability that World War never breaks out is then ,  so the probability that World War will break out (at least once) is 63%. Not for sure at all.

Everything that has a beginning must have an end.

False. The natural numbers have a beginning at 0, but, as there is no largest number, there is no end:
beginning ---> 0,1,2,3,4,5,6,.... ---> just keeps counting, no end.
Time itself may have a beginning in the big bang, but not necessarily an end (even when the universe reaches its maximal entropy state, quantum fluctuations ensure that change still occurs).

There's only one size of infinity.

False. There's an infinity of infinities. You can't count infinities, but you can pair one value from each of the two infinities, and see if there's anything left in either afterwards. This is analogous to determining whether there are more passengers or seats in an airplane by asking everything to set down, and check whether some people are still standing, or some seats available after the process. You don't count, you only compare relatively.

In this sense the set of natural numbers is smaller than the set of real numbers. That's two different infinities. To keep going, one shows that the power set is always larger than the original (infinite) set. The power set is the set of all (unordered) permutations of the original set. For example, the power set of {1,2} is { {}, {1}, {2}, {1,2} }. Hence an infinite hierarchy of sets.

Another example: when you enumerate numbers as 0,1,2,3,4,... you will eventually reach the first infinite number (ordinal), but logically speaking, there's nothing preventing you from continuing by doing infinity+1, infinity+2,... until you reach 2*infinity. By the same logic, then, you can have infinity*infinity, infinity^infinity, and so on.

Do these "higher" infinities exist in the real world? Maybe not. But in that case, the real numbers are not real at all, since they belong to a higher infinity...

See ordinal and cardinal numbers for more information.

Computers only deal with numbers. They can't know that , as we do.

False. This is based on the assumption that the only way a computer can compute the equality is by first squaring pi, and then taking the square root of that number.

Now, pi has an infinite non-repeating decimal sequence (it is an irrational number; in fact transcendental), so it is not even possible for a computer to store all the decimals of pi. This reasoning  is correct; what is incorrect is the assumption that the computer is limited to only performing such calculations when determining the equality.

Just like humans, the computer can store the algebraic rewrite rules
 ,    ,    ,  ,  and 

 from which it can easily deduce the equality:  

In other words, the computer can do it in just the same way as humans do it. And just like the computer can't store the infinite decimal sequence of pi and square that, neither can we.

All numbers have a unique representation. Thus 0.99999... = 1 is not true.

False. When you write 0.999..., the ellipsis is intended to mean "the 9s continue forever". This is thus a compact way to write 0.9 + 0.09 + 0.009 + ... This infinite series equals 1, so 0.99999... = 1 is in fact true. (Mathematically, the sentence asserts that the partial sums converge to 1, which is true.)

The number 1 can be represented in many other (decimal) ways: 1.0, 1.00, 1.000, and so on. If we only use infinite decimals, there are only two ways to represent each number. For example, the number 12.345 can be represented as 12.3450000... or as 12.3449999...

If we can compute an infinity of things in finite time, we can break any problem.

False. Assume for instance we have a hypercomputer blackbox that decides whether any given algorithm (not using the hypercomputer) terminates on all inputs or not (it solves the Halting problem). Such a hypercomputer cannot actually decide whether machines equipped with its own blackbox functionality terminates on all inputs, and thus, it cannot decide this particular "hyper" halting problem.

See arithmetical hierarchy for more information.


  1. > If something has non-zero probability of
    > occurring, it will eventually happen for sure.

    I'm interested in the variation;

    "If something has a low enough probability of occurring, it will never happen for sure."

    For example, the risk of Git's SHA1 hashes ever colliding. Usually, when confronted with the question what happens in the case of collision, people start waving their hands and talking about the number of atoms in the universe.

    I'm no statistician, but I'm curious: even if the risk of collision is so close to zero, what's to say it doesn't happen twenty minutes from now?

    1. SHA-1 uses 160 bits, so the probability of finding a collision (using a birthday attack, i.e. brute force) would be 1-e^(-n^2/2D), where D = 2^160 here and n is the number of trials. If you perform a billion trials per second for 20 minutes, then n = 10^9*60*20. That number is so close to zero that the bignum libraries I use still just report 0.0.

      From another perspective: "What is the expected number of hashes we need to generate before we get a collision?". The answer, for SHA-1, is more than 10^24 trials, or 1 trillion trillion. If you do a billion per second, it will take you more than 1.9 billion years. Don't hold your breath...

    2. Thanks, I appreciate the concrete examples!

      I'm probably (pun intended) suffering from probabilistic illiteracy [1], but I have a hard time seeing how the extremely low probability of collision makes it safe to assume it will never happen (not sure if Git actually makes this assumption.)

      I laymanly think about probability as a model for how often something happens over an infinite number of experiments. But in this case I don't think I'm interested in how often, but rather if at all.

      Assuming we have a space of experiments like this:

      1 [ . . . . . . . . . . . . . . . . . . . .] 10^24

      Is there some aspect of probability I'm missing that forces the collision towards the end instead of the beginning?



    3. For the particular case of collisions, the probability of finding on using a birthday attack actually increases over time - because we keep collecting more and more hash values in the hope of finding a collision between two of them.

      That's the reason why it's far more likely to be in the end.

      Thus, collision finding is NOT like rolling an (n-sided) dice until you get a certain number. The former collects outcomes and observes when two of them match, the latter just checks whether an outcome matches a fixed number.

      In the cases of (n-sided) dices, however, there is also a correct intuition of "the bigger n, the less likely it will happen in the first 100 trials".

      Consider this: if you roll a 6-sided dice and the event we're looking for is an outcome of 1 ("collision" with probability 1 in 6), in 1000 experiments you are as likely to find it in the first 10 as the last 10.

      The same holds true for a 100-sided dice, but now the probability of the even is 1 in 100, so both in the first 10 and last 10 trials the probability of observing a 1 is significantly lower.

      It's as likely in the beginning as in the end, but that's not the issue, the issue is that it's less likely to happen at BOTH the beginning and the end (and indeed, at any time interval of equal size).

      What you seem to be confusing, is thus the correct:
      "the bigger n, the less likely it will happen in the first 100 trials (because it is less likely to happen at any interval!)", with the incorrect "if it's as likely to happen at any point/fixed interval, it is likely to happen".

    4. Thank you for taking the time to answer, I really appreciate it!

      I see your point about collision finding being different from random events. I was thinking of the bastardization "generate two hashes and see if they're the same," but it doesn't really reflect the real-world hash-collision problem.

      I also follow and intuitively understand the dice sides vs. hash size comparison, that makes sense to me.

      I still can't build intuition around your last sentence, though. "It is likely to happen" is not what I'm suggesting -- the risk/chance is infinitesimal -- but it seems just as wrong to claim "it cannot happen".

      It feels like a jump from a statement about the probability distribution to a statement about specific events, and that seems suspect to me.

      Thanks for your eloquent explanations, they gave me some backing knowledge to better understand what I was worried about, even if I still don't see it clearly :-)

    5. You have to put things into perspective too - usually the dangers of a collision are far from catastrophic. For example, in hash tables, hashing is used to reach a certain bucket, but the object to find is checked for equality against all elements in the bucket (even if there's only one, as it might not be the element we're looking for). Thus a collision (= more than one element in a bucket) means we lose performance, not correctness.

      On the other hand, my previous analysis is mostly just helpful to understand collision probabilities in principle, not in practice.

      First, hash functions are never perfect. Collisions are more likely to happen than the theoretically safest case (always uniform).

      Second, hash functions are susceptible to mathematical attacks (even the existence of so called "one-way functions" hasn't been proved, and even P < NP would not resolve the issue). So I'd worry a lot more about a "mathematical attack" than brute force.

      Third, the time estimations I gave are unrealistic, because they ignore Moore's Law: computer power doubles every 18 months. It also assumes quantum computers won't be around; they can easily break some schemes with enough memory (notably, RSA public key encryption => https not safe).

      But most people don't worry about these things, because data that is protected today will have little value in even 5 years from now. So it's not about "being safe forever", rather "being safe for X days". Usually X is not that big: the secret behind the atom bomb is hardly a secret anymore, schematics for a product are less valuable once the products have been superseded, etc.

      Point being, it's not only about the risk, but about what happens if things go bad. If you have top secret information, you're not even gonna plug that computer into the internet. You might not even want to store the information digitally. For most of us, that's not the case, and traditional methods of encryption work just fine.

  2. Thanks. Yes, the consequences of the inevitable happening are most of the time quite bearable.

    But there's a tendency for engineers to design systems as if the improbable could never happen, and that annoys the pessimist in me.

    Just last week we ran into a critical failure due to a GUID collision under very specific circumstances. I never thought I would live to see that (turned out to be due to a sub-par GUID generation implementation in a third-party library.)

    Thank you for describing the probability side of all this!

    1. If the implementation is sub-par (i.e. the hash function is flawed) and the probability of a collision is high, then the problem is not in the approximation "very unlikely to happen => will not happen", but that it wasn't very unlikely to happen.

  3. Bra! Både själva posten och diskussionen med Gräsman.