## Saturday, March 31, 2012

### Learning: from Correlation to Causation

A well known fact, hopefully, is that correlation does not imply causation.

You may know that cancer is positively correlated with smoking, but it probably does not imply that having cancer makes you smoke. You'll assume the opposite - that smoking causes cancer - which is scientifically established.

Sometimes things are not that simple. You may for example observe that smoking and wealth correlate. It does not seem feasible to think that smoking causes people to become richer, but rather the opposite. On the other hand, it is not necessarily the cause that wealth increases the likelihood of people becoming smoking, in light of the fact that smokers typically start at an early age, and people tend to become rich at a later stage in life.

The problem here is that a third variable may affect both. It may be, for example, that people that were born into wealthy families are more likely to smoke (for instance, because they can afford to), and obviously, become richer than average as they inherit wealth.

The relationships can become arbitrarily complex. There needs not be a third "hidden" variable that affects two observations, but rather a messy graph of this-causes-that which just happens to give rise to an observed correlation. Science is indeed difficult.

So far so good. What you may not know, is that the "correlation does not imply causality" is just half the story.
The other half, simply put, is that correlation is still the only thing you can observe. Nature does not give us ways of observing causality, only events, which we then correlate. For example, the following are observations that we may observe:

At time 12:30:20, and apple is in my hand.
At time 12:30:25, I let go of the apple.
At time 12:30:26, the apple is on the ground.

Here is what we cannot observe:

Things fall to the ground when not kept suspended.

The point here is that causality is something we postulate based on the observed correlations. The fundamental assumption of science is that causality exists; that is why we write down equations that are supposed to predict what will happen based on past or present physical states.
To our help, we usually make reasonable assumptions, notably, that events in the future cannot affect the present or past.

You may think my conclusion, which attributes causation, is an obvious one to make, so that the difference between correlation and causation is not that important. You might be surprised, then, to find out that the conclusion is in fact incorrect in general, and this would easily be seen onboard a space station (Nasa's ISS, for example), as "things" would stay just where they were left when not kept suspended.

So: it may be a logical fallacy to assume causality based on correlation, but there is no other way to do science, or to learn for that matter. We must assume causality based on correlation. Different hypotheses give rise to different predictions, which may then be experimentally tested. If the theoretical prediction does not correspond to the experimental measurements, the hypothesis is clearly inconsistent.

There's always people that are looking for pathological cases for which a theory does not work. That's great, cause we need to eliminate theories that are wrong. But in the end, those theories would not even exist if it wasn't for people that dared to postulate a cause-effect relationship based on observations.

Finally, a good thing to know is that correlation, in the statistical sense, is in fact restricted to linear relationships. They do not account for more complicated relations between concepts, and indeed, it is possible to that two observables are strongly related but have little correlation.

Truth to be told, the best way to visual relationships in data may very well be visual examination of computer generated plots. It may not be a rigorous mathematical analysis, but it will certainly point you in the right direction. If you don't know what you're looking for, you're unlikely to find it.

But what if the data you are looking for is non-numeric, or of higher than three dimensions? Visualization becomes harder for hyperspaces or arbitrary graphs. How can we make machines help us learn from such data?

For numeric data, such as inserting the coordinates for a canon ball travelling through the air at different times, there are many techniques from numerical optimization.

The less mathematical ones, evolutionary algorithms, such as genetic algorithms or estimation of distribution, are common choices. Let's say that, for simplicity, you would like to have the canon ball's height at different time. Here's what you postulate:
$h(t) = g \frac{t^2}{2} + v_0 t + h_0$
h describes the height as a function of t, the time. But we also need to tune the constants given in the equation, as we have no numeric values at all. So we have 3 constants to tune. Our examples are given in the form:
h(0) = 0
h(0.1) = 0.5
h(0.2) = 0.8
...
i.e. as observations about what height the ball is at during what times. This is a simple task for a genetic algorithm: given enough examples, it will tune the parameters correctly. Related to our discussion of Correlations and Causation, the examples we are given are "correlations", in the sense that we correlation time (t) with height (h). The "correlation", however, is non-linear, so relation would be a better term to use.

What is the causality relation that we are postulating here?
It is the equation itself: that the ball will be at height h at time t, as given by the formula.
However, we can simplify our postulate by taking the second derivative with respect to time:
$\frac{d^2}{dt^2}h(t) = g$

Now, interpreting the derivative of height (with respect to time) is what we call the velocity. Interpreting the derivative of velocity (derivatives are associative), this becomes the acceleration. So we are postulating that the acceleration is constant and given by g.

After finding out that g is about 10, it is not too hard to interpret the other parameters when looking at the zeroth and first derivative.

Our theory was partially given in the form of the equation. What remained to be done was only tuning the parameters. What if we don't even have the formula to begin with?

Genetic algorithms, in their basic form, cannot give us the answer. Linear regression methods assume linearity, so they will all fail too. What we need is a machine learning method for which we can build the equation itself. Such a method exists: genetic programming.

Give it some basic blocks, such as squaring (or, more generally, exponentiation), multiplication, addition, division (or multiplicative inverse), and it will try out lots of different combinations, evaluating them for how well they fit the data. Fitting the data is usually described by a fitness function; one obvious is:
$-\sum_{t\in E} |H(t)-h(t)|$
which defines the fitness (i.e. solution quality) in terms of minimizing the error between out sought after function H and the data measurements h(t).

Genetic programs try to find H by combining the provided building blocks (addition, division, etc) and, in our case, by trying out different constants (this can be done in many ways, for example by combining a genetic programming with a genetic algorithm), until it finds our formula. If the measurements we give the system (the examples) are not numerically precise enough, we may of course get a different H, for example:
$H(t) = gt + v_0t + h_0$
where the parameters are not the correct ones either (we may for example have g = 12).

Ok, now that you understand the principles behind genetic algorithms, which are used for numerical optimization (parameter tuning), and genetic programs, which are used for function regression, let's look at more difficult problems which have nothing to do with numbers at all.
Let's say we observe the following:

swan A is white.
swan B is white.
swan C is white.
...

Based on this, we would like to have a theory that says "all swans are white" (this is not generally correct, but given the observations, it is a justified hypothesis). How do we do this? This is a logical problem. It has to do with reasoning about objects, and not about computing numbers. Now, logic is in a sense about boolean values, so we could adapt a genetic program to help us. But this is not what genetic programming was developed for, and the mileage for such adaptations is rather short.

One system that was inherently designed for logical problems is inductive logic programming. The examples would be inputted as follow:

white(swanA).
white(swanB).
...

Then, the inductive logic program would output the following answer:

white(X).

This is the logical formula "for all X, X is white", which states that all swans are white. In fact, that everything is white, but that is only because we didn't specify swans to be of a certain type:

white(swanA). % swanA is white
swan(swanA). % swanA is a swan
...

Output:
white(X) :- swan(X).

This reads: X is white if X is a swan. Put differently: if X is a swan, it is white. Even simpler: all swans are white.

Inductive logic programming can be used on problems of arbitrary complexity (at least in theory, the computational demands may be forbidding by today's computers):

parent(mary,jack). % mary is a parent of jack
parent(jack,mandy). % jack is a parent of mandy
grandparent(mary,mandy). % mary is a parent of mandy

Given this, an inductive logic program is capable of answering:

grandparent(X,Z) :- parent(X,Y), parent(Y,Z).

In words: X is a parent of Z if X is a parent of Y, and Y is a parent of Z.
This is the correct defining of a grandparent.

Here, the correlations we observe are the examples: that mary is a parent of jack, that jack is a parent of mandy, that mary is a grandparent of mandy. Reality does not tell us what causes what.

The ILP system postulates a causal relation that is consistent with the observations:
That mary is a grandparent of mandy because mary has a child, which is the parent of mandy.
This is something that observations cannot tell us (it may seem a bit silly in this case, since the concept of a parent and grandparent are man made, but it's not too hard to see that we could have used a physical interpretation of reality, for which the concepts are arguably not man made).

Again, my point:
We observe correlations (no causality). This is the examples/observations.
We generalize them into a theory. The theory may describe a causal relation.
The test them using predictions.

To see the last point, let us say that the system observes two new things:
parent(josh, simon).

The system now makes a prediction: that adam is the grandparent of simon: