Saturday, March 31, 2012

Learning: from Correlation to Causation

A well known fact, hopefully, is that correlation does not imply causation.

You may know that cancer is positively correlated with smoking, but it probably does not imply that having cancer makes you smoke. You'll assume the opposite - that smoking causes cancer - which is scientifically established.

Sometimes things are not that simple. You may for example observe that smoking and wealth correlate. It does not seem feasible to think that smoking causes people to become richer, but rather the opposite. On the other hand, it is not necessarily the cause that wealth increases the likelihood of people becoming smoking, in light of the fact that smokers typically start at an early age, and people tend to become rich at a later stage in life.

The problem here is that a third variable may affect both. It may be, for example, that people that were born into wealthy families are more likely to smoke (for instance, because they can afford to), and obviously, become richer than average as they inherit wealth.

The relationships can become arbitrarily complex. There needs not be a third "hidden" variable that affects two observations, but rather a messy graph of this-causes-that which just happens to give rise to an observed correlation. Science is indeed difficult.

So far so good. What you may not know, is that the "correlation does not imply causality" is just half the story.
The other half, simply put, is that correlation is still the only thing you can observe. Nature does not give us ways of observing causality, only events, which we then correlate. For example, the following are observations that we may observe:

At time 12:30:20, and apple is in my hand.
At time 12:30:25, I let go of the apple.
At time 12:30:26, the apple is on the ground.

Here is what we cannot observe:

Things fall to the ground when not kept suspended.

The point here is that causality is something we postulate based on the observed correlations. The fundamental assumption of science is that causality exists; that is why we write down equations that are supposed to predict what will happen based on past or present physical states.
To our help, we usually make reasonable assumptions, notably, that events in the future cannot affect the present or past.

You may think my conclusion, which attributes causation, is an obvious one to make, so that the difference between correlation and causation is not that important. You might be surprised, then, to find out that the conclusion is in fact incorrect in general, and this would easily be seen onboard a space station (Nasa's ISS, for example), as "things" would stay just where they were left when not kept suspended.

So: it may be a logical fallacy to assume causality based on correlation, but there is no other way to do science, or to learn for that matter. We must assume causality based on correlation. Different hypotheses give rise to different predictions, which may then be experimentally tested. If the theoretical prediction does not correspond to the experimental measurements, the hypothesis is clearly inconsistent.

There's always people that are looking for pathological cases for which a theory does not work. That's great, cause we need to eliminate theories that are wrong. But in the end, those theories would not even exist if it wasn't for people that dared to postulate a cause-effect relationship based on observations.

Finally, a good thing to know is that correlation, in the statistical sense, is in fact restricted to linear relationships. They do not account for more complicated relations between concepts, and indeed, it is possible to that two observables are strongly related but have little correlation.

Truth to be told, the best way to visual relationships in data may very well be visual examination of computer generated plots. It may not be a rigorous mathematical analysis, but it will certainly point you in the right direction. If you don't know what you're looking for, you're unlikely to find it.

But what if the data you are looking for is non-numeric, or of higher than three dimensions? Visualization becomes harder for hyperspaces or arbitrary graphs. How can we make machines help us learn from such data?

For numeric data, such as inserting the coordinates for a canon ball travelling through the air at different times, there are many techniques from numerical optimization.

The less mathematical ones, evolutionary algorithms, such as genetic algorithms or estimation of distribution, are common choices. Let's say that, for simplicity, you would like to have the canon ball's height at different time. Here's what you postulate:

h describes the height as a function of t, the time. But we also need to tune the constants given in the equation, as we have no numeric values at all. So we have 3 constants to tune. Our examples are given in the form:
 h(0) = 0
h(0.1) = 0.5
h(0.2) = 0.8
...
i.e. as observations about what height the ball is at during what times. This is a simple task for a genetic algorithm: given enough examples, it will tune the parameters correctly. Related to our discussion of Correlations and Causation, the examples we are given are "correlations", in the sense that we correlation time (t) with height (h). The "correlation", however, is non-linear, so relation would be a better term to use.

What is the causality relation that we are postulating here?
It is the equation itself: that the ball will be at height h at time t, as given by the formula.
However, we can simplify our postulate by taking the second derivative with respect to time:


Now, interpreting the derivative of height (with respect to time) is what we call the velocity. Interpreting the derivative of velocity (derivatives are associative), this becomes the acceleration. So we are postulating that the acceleration is constant and given by g.

After finding out that g is about 10, it is not too hard to interpret the other parameters when looking at the zeroth and first derivative.

Our theory was partially given in the form of the equation. What remained to be done was only tuning the parameters. What if we don't even have the formula to begin with?

Genetic algorithms, in their basic form, cannot give us the answer. Linear regression methods assume linearity, so they will all fail too. What we need is a machine learning method for which we can build the equation itself. Such a method exists: genetic programming.

Give it some basic blocks, such as squaring (or, more generally, exponentiation), multiplication, addition, division (or multiplicative inverse), and it will try out lots of different combinations, evaluating them for how well they fit the data. Fitting the data is usually described by a fitness function; one obvious is:

which defines the fitness (i.e. solution quality) in terms of minimizing the error between out sought after function H and the data measurements h(t).

Genetic programs try to find H by combining the provided building blocks (addition, division, etc) and, in our case, by trying out different constants (this can be done in many ways, for example by combining a genetic programming with a genetic algorithm), until it finds our formula. If the measurements we give the system (the examples) are not numerically precise enough, we may of course get a different H, for example:

where the parameters are not the correct ones either (we may for example have g = 12).

Ok, now that you understand the principles behind genetic algorithms, which are used for numerical optimization (parameter tuning), and genetic programs, which are used for function regression, let's look at more difficult problems which have nothing to do with numbers at all.
Let's say we observe the following:

swan A is white.
swan B is white.
swan C is white.
...

Based on this, we would like to have a theory that says "all swans are white" (this is not generally correct, but given the observations, it is a justified hypothesis). How do we do this? This is a logical problem. It has to do with reasoning about objects, and not about computing numbers. Now, logic is in a sense about boolean values, so we could adapt a genetic program to help us. But this is not what genetic programming was developed for, and the mileage for such adaptations is rather short.

One system that was inherently designed for logical problems is inductive logic programming. The examples would be inputted as follow:

white(swanA).
white(swanB).
...

Then, the inductive logic program would output the following answer:

white(X).

This is the logical formula "for all X, X is white", which states that all swans are white. In fact, that everything is white, but that is only because we didn't specify swans to be of a certain type:

white(swanA). % swanA is white
swan(swanA). % swanA is a swan
...

Output:
white(X) :- swan(X).

This reads: X is white if X is a swan. Put differently: if X is a swan, it is white. Even simpler: all swans are white.

Inductive logic programming can be used on problems of arbitrary complexity (at least in theory, the computational demands may be forbidding by today's computers):

parent(mary,jack). % mary is a parent of jack
parent(jack,mandy). % jack is a parent of mandy
grandparent(mary,mandy). % mary is a parent of mandy

Given this, an inductive logic program is capable of answering:

grandparent(X,Z) :- parent(X,Y), parent(Y,Z).

In words: X is a parent of Z if X is a parent of Y, and Y is a parent of Z.
This is the correct defining of a grandparent.

Here, the correlations we observe are the examples: that mary is a parent of jack, that jack is a parent of mandy, that mary is a grandparent of mandy. Reality does not tell us what causes what.

The ILP system postulates a causal relation that is consistent with the observations:
That mary is a grandparent of mandy because mary has a child, which is the parent of mandy.
This is something that observations cannot tell us (it may seem a bit silly in this case, since the concept of a parent and grandparent are man made, but it's not too hard to see that we could have used a physical interpretation of reality, for which the concepts are arguably not man made).

Again, my point:
We observe correlations (no causality). This is the examples/observations.
We generalize them into a theory. The theory may describe a causal relation.
The test them using predictions.

To see the last point, let us say that the system observes two new things:
parent(adam, josh).
parent(josh, simon).

The system now makes a prediction: that adam is the grandparent of simon:

grandparent(adam,simon).

If this turns out to be true, it confirms the theory. If it turns out to be false, the theory will need to be rejected (or revised).

In the language of inductive logic programming:
Real world observations are ground terms.
Induced theories can be any logical formula (ILP induction).
Predictions are made using SLD-inference (deduction).
Tests are made using SLD-refutation.

Friday, March 30, 2012

Small String Optimization and Move Operations

EDIT: A lot of readers seem more interested in my implementation of the String class than the purpose of this post, which is to show a somewhat remarkable fact about how small string optimization interacts with move operations. Please don't post more "This implementation is not the most efficient". You are missing the point. The example implementation of my string class here is neither complete, efficient, nor representative of STL implementations. It is only used to illustrate the idea behind small string optimization. I don't even know if the code actually compiles.

Typically, std::string is implemented using small string optimization, a technique by which small strings are stored directly inside the object, rather than dynamically allocated on the heap.

Before I start discussing the issue with small string optimization, let me give you a quick introduction to how it works. Let's say we want to implement a very simplified string class ourselves. In each String object, we simply keep a pointer to the beginning and to the terminating null character of the char array (std::string also needs to keep a pointer to an allocator, as you are allowed to specify your own memory management).


class String {
public:
  String(const char* p);
  String(const String& s);

  String(String&& s);
  ~String();
  unsigned length() const { return _size; }

  char* begin() { return _beg; }
  char* end() { return begin() + _size; }
  // ...
private:
  char* _beg; // first character in array
  unsigned _size; // excluding terminating null char
};


We can implement these constructors in a straightforward manner, using low level programming for efficiency:


inline String::String(const char* p) : _size(strlen(p))
{
    _beg(new char[_size+1]);
    memcpy(_beg,p,_size+1);
}

inline String::String(const String& s) : _size(s.length())
{
    _beg = new char[_size+1];
    memcpy(_beg,s._beg,len+1);
  }

inline String::String(String&& s) : _size(s.length())
{
   _beg = s._beg;
   s._beg = nullptr; // zero out
}

inline String::~String() { delete [] _beg; }



(The implementation will be slightly faster if you replace new/delete with malloc/free, as there is no need to call any constructors/destructors for char arrays. In theory, compilers can optimize this away; in practice, they don't.) See this post for a discussion on this issue (unrelated to small string optimization and move operations).

So far so good, but now you realize that you can do something clever: since we need to a char* pointers in every string, why not use that space to store small strings of length up to sizeof(char*)? That way, we won't need to call new/delete for small strings. In fact, since strings are in most applications quite short, why not store a little more while we're at it?

This is known as small string optimization. One way we can efficiently and elegantly implement this is by using tagged unions. Let's try it out:


class SString {
public:
  SString(const char* p);
  SString(const String& s);

  SString(String&& s);
  ~SString();
  unsigned length() const { return _size; }

  char* begin() { return _size < 16 ? str : _beg; }
  char* end() { return begin() + _size; }
  // ...
private:
  unsigned _size;
  union {
    char* _beg;
    char str[16];
  };
};


Note how SString::begin() was defined to either return _beg or str, depending on the size of the string.


The constructor for const char* is not that hard:


inline SString::SString(const char* p) : _size(strlen(p))
{
    if (_size < 16) {
      // Small string optimization: store in static str
      memcpy(str,p,_size+1);
    } else {
      // Dynamically allocate string (like before)
      _beg(new char[_size+1]);
      memcpy(_beg,p,_size+1);
    }
}


The copy constructor is pretty much the same:


inline SString::SString(const SString& s) : _size(s.length())
{
    if (_size < 16) {
      // Small string optimization: store in static str
      memcpy(str,s.str,_size+1);
    } else {
      // Dynamically allocate string (like before)
      _beg(new char[_size+1]);
      memcpy(_beg,s._beg,_size+1);
    }
}

Now comes the interesting part where you'll see the trade off when using small string optimization. Here is the move constructor:


inline SString::SString(SString&& s) : _size(s.length())
{
    if (_size < 16) {
       // ???
    } else {
       _beg = s._beg; // steal resource from s
       s._beg = nullptr; // prevent s from deleting on destruction
    }
}

So when we have a dynamically allocated array, we can steal the resource. When we've used the small string optimization, on the otherhand, there is no way to steal the resource, as it is statically allocated. We must hence copy it:


if (_size < 16) {
      memcpy(str,s.str,_size+1);
} else {

What may be surprising to a string user, is that the  performance degradation occurs only for small strings.
This effect can be seen on MSVC 2010, which uses a small string optimization of 16 characters for std::string. We take a command line string, then move the string between two string objects 10 million times.
Here's the result:
With the a string of 16 and 17 characters, including the null terminating char, the execution time in seconds is:

Length      Time
16             0.171
17             0.06

In other words, the small string optimization, which is now the small string deterioration, is about 3 times slower than copying pointers. This leads to the surprising conclusion that when moving strings, it is actually faster to move larger strings.

Note that most of the time, you don't just move around strings, but allocate them, deallocate them, copy them, etc. In all those cases, of course, small string optimization is very useful.

Wednesday, March 28, 2012

Know your containers: a danger with std::vector

Believe it or not, the following code contains a bug:

int f(int); // function f defined elsewhere

vector<int> vec(1,1); // start with singleton value 1
// Call f on each element. If result is positive, add it to vector.
for (auto i = vec.begin(); i != vec.end(); ++i) {
   const int res = f(*i);
   if (res > 0) {
      vec.push_back(res);
   }
}

Can you spot the bug? (And no, it's not a trick question. f is defined elsewhere, that's not the problem. Neither is it the fact that the algorithm may be non-terminating, we may for example assume that f will only return positive values a finite number of times.)

The problem is this: as we are traversing the vector, we are sometimes adding new elements, which we then want to also apply f on when the time comes. However, adding new elements to a vector may invalidate the iterators. This is because when a vectors capacity is reached, it will need to grow, which means reallocating all elements to another place in memory, which would necessarily invalidate all references and iterators.

One way to fix the problem is to use indices instead of iterators:

for (decltype(vec.size()) i = 0; i != vec.size(); ++i) { // using indices
   const int res = f( vec[i] ); // ok, dereferencing using index


Another way to fix the problem, if you know an upper bound on the number of elements that are going to be added, is to call vector::reserve first. As long as no reallocations take place, you are safe in using iterators when inserting.

Or, you could use std::list, for which no insertion invalidates the iterators:
list<int> vec(1,1);
// rest of code looks the same as above:

for (auto i = vec.begin(); i != vec.end(); ++i) {
// push_back will not invalidate iterator i here, since i iterates a list

Bottom line:
Don't add elements while traversing a vector. Usually, this indicates a design flaw (or possibly a bug) in your code to begin with. If you really have to, reserve enough space so that the vector will not be reallocated (this assumes you have a reasonable upper bound to begin with). Indices could work, although they don't work with STL algorithms. Otherwise, just switch to list (or forward_list).

Tuesday, March 27, 2012

C versus C++ performance: printing and string conversions

A widely held belief is that C++ code is inherently slower than C code. Although this is true sometimes, it is usually largely a myth. In fact, there are situations in which the opposite holds true: C++ is faster than C.

Ironically, C++ is faster in one of the instances where people typically expect the opposite: printing with C++'s iostream library versus C's stdio library.

Using Microsoft Visual C++ 2010, I tested printing "Hello World!" a million times to a file using ofstream and fprintf. Here are the results:

C          C++        Comparison
0.381     0.24       C++ is 60% faster than C

To be fair, this myth problems stems from a very related fact: if we try out the same experiment printing random integers, we get the following:
C          C++        Comparison
0.307     0.929      C is 3 times faster than C++

If you've done conversions between strings and numeric types prior to C++ 11, you've probably suspected that these conversions are not very efficient:

string to_string(int i)
{
  stringstream ss;
  ss << i;
  return ss.str();
}

However, there's no reason why iostream can't make full use of C's conversion routines for better efficiency. To see that conversions are really the problem here, let's redo the experiment by first converting the integers into strings using std::to_string, and then printing out the strings with ofstream:

C          C++        Comparison
0.306    0.606       C is 2 times faster than C++

Yes, this actually means that:
cout << to_string(42);
is faster than:
cout << 42;

This is obviously a design flaw in the Microsoft's C++ implemention (which is otherwise an excellent C++ compiler), as we could get a better implementation by simply changing the library:

ostream& operator<<(ostream& os, int i)
{
  // Replace current code with:
  os << std::to_string(i); // use std::string overload; also note that to_string never throws here
}

In fact, converting it to a std::string and then printing has some overhead of its own (probably very little due to the small string optimization), but we can get closer to C conversion performance using sprintf to convert the integer into a C string buffer (its easy to calculate the maximal length of the array using log, sizeof and adding 2 more chars, one for the potential minus sign and one for null terminating the string):
With this, we get down from 0.606 to 0.5 seconds. There's still a gap of 64% in performance, which is unexplained for. What it does mean, however, is that Microsoft's C++ library did not prioritize efficient type conversions to strings, which does happen quite a lot when parsing.

Conclusion:
C++ prints faster than C.
C converts numeric types to strings faster than C++.
There's no reason why C++ should convert slower than C, as it has full access to C's underlying implementation of the stdio library (it's included in C++ too, after all).

Is reality a computer simulation?

I just started watching the so far excellent documentary series Through the Wormhole.
The first episode talks about whether there's a creator, and what such a creator may be like. One presented possibility is that we live in a computer simulation, that is, our entire universe is a (gigantic) computer simulation, and thus, that we are digital beings unknowingly "trapped" inside this simulation (think about the Matrix).

This actually presupposes a lot more than you may think. First, it assumes the laws of nature are possible to simulate on a computer. This is for example impossible if there is such a thing as process that requires infinite computation. Essentially, it constrains our universe to a finite one: particles must be finitely divisible, time must be finitely divisible, and the universe extends finitely much in all directions.

The first two are not controversial aspects of reality: already the ancients postulated the existence of indivisible matter, and they called it the atom (it turns out that atoms are in fact divisible, but quarks or superstrings may not be; the misnomer is however our mistake, not the ancient's).

That the universe is finite may sit uncomfortably with an intuitive sense that "if I just keep going in one direction with my space ship, I should not expect to hit a wall somewhere". Indeed, that would be very strange. But it is possible that the universe is closed, in the sense that if you travel in one direction for long enough, you'll end up at the same place you began (and no, this is not inconsistent with the measure flatness of the universe. The flatness is based on the notion of parallel lines: read about it here).

None of these postulates are controversial: it is widely believed that matter is finitely divisible, and, as for the universe, the evidence with respect to closeness is inconclusive so far (when looking out into the void, they search for repetitions as we "go around the universe one full turn", but due to the vast size of the universe, this is easier said than done).

Now that we know nothing about our current understanding for physics excludes the possibility of a computer simulation, let's look at how likely it is. The claim made by one of the speakers is that quantum physics bears a resemblance to the way a computer simulation works: from far away, it all looks good. But as you zoom in, you find that things start getting blurry. If you've ever tried to zoom in on an image on your computer, you've probably seen it: at some point, the details starts breaking down, and you start to see how the pixels are blurred out. Now, quantum physics surely look that way: as you start zooming in, particles start looking less like matter and more like waves, i.e. blurry.

I think this may be misrepresenting quantum physics. The double slit experiment does not only imply that "things blur out when nobody is looking", it also implies the existence of "virtual particles" that, at the other end of the slit, interfere with each other. More precisely, the result of the experiment heavily relies on the fact that, when you are not looking at which slit the particles go through, they seem to go through both, and interfere with each other to create a pattern that would be impossible to get if the particles only went through one slit at a time. So in order to get the final pattern we see on the wall, our laws of nature have to compute not only what happens to the particle, but also to the virtual particle that it interferes with. This makes the calculation harder, not easier, which is in sharp contrast to a computer simulation, where the blur is omission of computation for the sake of efficiency. Now, it's not too hard to change the experiment so that you would have more than one virtual particle, in which case you would get a interference between all virtual particles, implying a vastly greater computational effort.

Moreover, there's in fact no reason to believe that quantum superposition "collapses" at some point, only that it decoheres into a macroscopic state where you can't measure it anymore. Dropping the unnecessary assumption of wave collapse, and drawing the logic to its end, you'll find the many world interpretation, which briefly says that there are multiple universes, which can all interfere with ours when the quantum states resemble each other. It doesn't take much to see that have multiple universes makes data storage and computation vastly more demanding for a classic computer. Moreover, as the entropy increases over time, the number of possible states increases exponentially, meaning that any computer would run out of memory (rather quickly too). It is definitely not the way in which you would try to write simulate a universe in the cheapest possible way.

To me, it seems like quantum physics require more, not less, computation. That's also the consensus among quantum information scientists, as it takes exponential time to simulate a quantum computer on a classic computer.

Is it likely that we live in a quantum simulated reality? If the simulation is performed on a classical computer, the probability is, well, probably exponentially decreasing with time. In fact, if we have an infinity of universes in our multiverse, that would both theoretically and practically exclude the possibility of simulation on any digital and quantum computer with finite number of bits/qubits.

We could of course be simulated by a quantum computer that stores all the quantum states of our universe in memory, but my point is that the majority of arguments in favor of a digitally simulated reality do not hold, because quantum physics makes it less likely, not more. It should also be noted, that arguing for a quantum simulated reality is pretty much the same as asking: "what if all this reality is just an illusion, made to look like all this reality?". That is obviously not falsifiable, now or ever, because it presupposes that everything is the way it looks like, but at the same times suggest that it may not be real. As Morpheus would say then: what is real?

Sunday, March 25, 2012

Sharing some ideas

By doing a PhD, there's a lot of ideas coming and going (not always related to my research!).
I thought I'd give it a try and share them here; it's mainly going to be about computer science, logic, and philosophy.