Wednesday, March 26, 2014

std::vector with constructors that throw exceptions

When using std::vector, reallocating may happen for several reasons:

class A { ... };
std::vector<A> v;
v.push_back(A()); // may reallocate
v.resize(100); // may reallocate

When reallocation happens, all elements of the vector need to be copied or moved to a new memory location. There two things to wonder about then: will std::vector reallocate by copying, or moving the objects? And what if one of A's constructors (A(A&&)) throws and the other (A(const A&)) doesn't throw?

The table below shows the situation, as guaranteed by the C++ standard. (This article assumes that A's destructor doesn't throw any exceptions.)

Constructors No copy Copy throws Copy no throw
No move - Throws (copies) No throw (copies)
Move throws Throws (moves) Throws (moves) No throw (copies)
Move no throw No throw (moves) No throw (moves) No throw (moves)

In other words: if one of copy/move is not available, std::vector picks the other one (if none are available, you can't use std::vector). If both are available and one throws but not the other, it pick the one that doesn't throw. If both throw, it uses move (since we can't do better anyway). If none throw, it uses move.

The most non-obvious part is this: when std::vector has to chose between the slower copy constructor that does not throw, and the faster move constructor that may throw, it takes the slower, but safer, copy constructor.

What can I assume when an exception is thrown (from copy/move constructor)?

What are you allowed to assume about the state of the objects inside the vector after an exception is thrown? With respect to exceptions thrown by the copy and move constructors, you only get the basic exception guarantee: no memory leaks have occurred, and all objects (A) are in a well defined state. They may not have the same value as prior to the exception, but you can at least reassign them, e.g. v.back() = A();.

How does the compiler know whether a copy/move constructor throws?

It doesn't usually. If it can prove that it won't, it will take advantage of that to offer the best possible alternative (that's what the table above shows). If it can't prove, it will assume the worst: that the copy/move constructor may throw (in the table above). You can "force" it to assume it won't throw by using noexcept:

class A {
     A(A&&) noexcept; // force compiler to always move

Formally, the compiler encodes information about "provably not throwable" in std::is_nothrow_move_constructible (which, if pedantic, should perhaps be called is_proven_nothrow_move_constructible). The logic of the table above is then encoded in the conditional move operator std::move_if_noexcept. Indeed, this means different compilers may produce different results in terms of moving vs copying (but if a compiler moves, it's always because that was provably the better thing to do).

Do compilers actually do this correctly?

Microsoft Visual Studio 2013 with CTP 1 does not. Here's code to test it with your compiler:

struct A
 A() = default;
 A(A&&) { throw 1; }
 A(const A&) = default;

int main()
 std::vector<A> v;

 try {
 } catch (int) {
  std::cerr << "BUG: should have used copy constructor instead\n";

With MSVS 2013 CTP 1, this prints "BUG: should have used copy constructor instead". Using noexcept does not help either.

Is there a way to do avoid these issues?

Yes, only use std::vector when at least one of the copy/move constructor does not throw (and use noexcept to make sure the compiler sees that). If both constructors throw, use std::list. Since it never reallocates, you will never have these issues. std::deque can also be used if you only push_front and push_back elements, as it doesn't reallocate then.

What about exceptions thrown by other member functions (not copy/move constructor)?

(Off topic, only here to clarify the difference between the basic exception guarantee above and the more commonly known strong exception guarantee). Then you get the strong exception guarantee, which basically says that everything has been rollbacked to the state just prior to the thrown exception. All objects have the values you'd expect. This is just like when performing database transactions.

Wednesday, March 19, 2014

Inductive Reasoning Visualized

Ever wondered what inductive reasoning (technically, using inductive logic programming) would look like if you could draw a picture?
Here's how:

I'll explain what you're looking at. Let us say you know the following facts:
Gaia is a parent of Cronus.
Cronus is a parent of Zeus.
Zeus is a parent of Athena.

You also know that:
Gaia is a grandparent of Zeus.
Zeus is a grandparent of Athena.
Gaia is not a grandparent of herself.
Gaia is not a grandparent of Cronus.
Cronus is not a grandparent of Gaia.
Athena is not a grandparent of Cronus.

Now you ask the computer to induce a definition of grandparenthood based on these facts.

To do this, the machine needs to try different possibilities, and these are what you see in the graph.

On top, you see:
Now the sentence "X is a grandparent of Y" is what the machine writes as "grandparent(X,Y)", and an underscore simply means "anybody". So this is the hypothesis that "anybody is a grandparent of anybody".
The machine knows this to be false because of what you see in the red square: 4. Four is the number of "problems" found with this hypothesis: namely, it predicts that "Gaia is a grandparent of herself", which we know to be false. It predicts every instance of "is not a grandparent of" above, and there are 4 of them. Thus this hypothesis is not consistent with observations (its predictions are falsified).

Next we have
grandparent(A,_) :- parent(A,_)
which in English reads "A is a grandparent of anyone if A is a parent of anyone". As you can see, the red box says 3, because it has 3 problems: it predicts that Gaia is a grandparent of herself, since Gaia is a parent (of whom does not matter, but it happens to be Zeus), which we know to be false. For the same reason, it predicts that Gaia is a grandparent of Zeus, which is also false. Finally, it predicts that Cronus is a grandparent of Gaia, since Cronus is a parent (again, of whom does not matter). The negative example "Athena is not a grandparent of Cronus" is not (incorrectly) predicted to be true, since Athena is not a parent.

This is the basic idea in Inductive Logic Programming: we construct lots of hypotheses and test them against the examples we have. There are two solutions that look promising (green boxes):
grandparent(A,B) :- parent(A,C), parent(C,B)
which states that A is a grandparent of B if A is a parent of some C and that C is a parent of B. This is indeed the correct definition, and it does cover both positive examples ("Gaia is a grandparent of Zeus", "Zeus is a grandparent of Athena"), and does not cover any of the 4 negative examples.

The other promising solution is
grandparent(A,B) :- parent(A,C), parent(C,B), parent(B,_)
which gets the grandparent relation slightly wrong: on top of requiring the correct conditions (A is a parent of some C, which is a parent of B), it also requires the grandchild B to be a parent. So according to this definition, Athena is not the grandchild of Gaia because she does not (yet) have any children, but when she does, she'll satisfy the conditions. The machine knows this definition is worse than the right one because it cannot explain the fact that Zeus is a grandparent of Athena. Hence it only explains one of the two facts (that's the 1 in the green gox).

I'll leave it as an exercise to interpret all the other hypotheses in the graph.

The picture was produced using my ILP system Atom.

Wednesday, February 26, 2014

How similar are two words? Or two formulas?

The problem

Consider the word "John". How similar is it to "Jon", "Johan", and "on"?

This of course depends on precisely what we mean by similar. But we have an intuition that "John" and "Jon" should be more similar than say "John" and "Table". In particular, "similarity" should have something to do with the occurrence of same characters, as well the order in which they occur. But can we make this idea precise?

The Solution

Yes, we can make this idea precise. The key is to consider how many edits would be necessary to transform one word into another:
To go from "John" to "Jon", simply remove the "h". (Or conversely, start with "Jon" and add an "h".)
To go from "John" to "Johan", add an "a". (Or start with "Johan" and remove "a".)
To go from "John" to "on", remove the "J" and the "h". (Or start with "on" and add "J" and "h".)

Thus it requires one edit between "John" and "Jon", and likewise for "John" to "Johan", but two edits between "John" and "on". Clearly the fewer edits the more similar to words are. In particular, a word has zero edit distance to itself, and only to itself, so it is most similar to itself.

Here we have assumed that an edit is a removal or addition. Given two words of length N and M, in the worst case we have to remove all N letters and then add all M, giving an edit distance of N+M. This happens precisely when all letters are distinct, so nothing can be reused: "John" and "Strawberry" has edit distance 4+10=14 since they share no letters in common.

We could also consider other operations, such as swapping a letter for another. Thus the edit distance from "John" to "Joan" is one edit (swap "h" for "a") instead of two (remove "h", add "a"). Then the maximum edit distance between two words of length N and M is max(N,M), since we can always swap the first min(N,M) letters and then add the remaining max(N,M)-min(N,M). For example, from "John" to "Strawberry", start by swapping "J" to "S", "o" to "t", "h" to "r", and "n" to "a", giving "John" -> "Stra", then add the remainder of the word, "wberry", for a total of 10 edits.

When the edits are specifically addition, removal, and swapping of characters, it is known as the Levenshtein distance. For the remainder of this post, this is implicitly what we'll mean by "edit distance".

Now we know the minimium (0) and maximum (max(N,M)) edit distances. But how do we compute the distance in general? From a logical point of view, we could "try everything":

Let \( s_i \) be the substring obtaining by keeping the first i characters of a string s and cutting off the rest. For example, if \( s \) = "John", then \( s_0 \) = "" and \( s_2 \) = "Jo" . Note that \( s = s_4 \).

The edit distance between two strings \( s \) and \( t \) of length N and M, respectively, can be defined using two base cases, namely when the strings run out:
\[ d(s_0, t_j) = j \]
\[ d(s_i, t_0) = i \]
and a recursive case when \( i > 0 \) and \( j > 0 \):
\[ d(s_i, t_j) = \min \{ d(s_{i-1},t_j) + 1, d(s_i,t_{j-1}) + 1, d(s_{i-1},t_{j-1}) + [s[i] \neq t[j]]  \} \]
where \( [s[i] \neq t[j]] \) is 1 when character at position i in s is the same as character at position j in t, and 0 otherwise. The computation is started using \( d(s,t) = d(s_N, t_M) \).

The edit distance is a metric (as long as all edit costs are positive real numbers), so they do define a "distance" in a fairly intuitive sense of the word. In particular, the triangle inequality holds.

Efficient Implementation

The definition above is triply recursive, and hence direct implementation has exponential complexity.

A much more efficient implementation can be obtained by observing that for the recursive case, \( d(s_i, t_j) \) only needs to know the values of \( d(s_{i-1},t_j) \), \(  d(s_i,t_{j-1}) \), and \( d(s_{i-1},t_{j-1}) \).
Thus, if we start by computing all possible base cases \( d(s_0, t_j) = j \) for all \( 0 \leq j \leq M \) and \( d(s_i, t_M) = i \) for all \( 0 \leq i \leq N \), we can then "unfold" the recursive values all the way back until we get to \( d(s_0, t_0) \).

For example, consider the case s = "John" and t = "on".

By using base cases (or trivial observation), d("","on") = 2, d("","o") = 1, and d("","") = 0. Likewise, d("John","") = 4, d("Joh","") = 3, d("Jo","") = 2, d("J","") = 1.

Then, using the recursive formula:
  1. d("J","o") = 1+d("","") = 1
  2. d("Jo",o") = min{ d("Jo","")+1, d("J","o")+1, d("J","")+0 } = min { 3, 2, 1 } = 1
  3. d("Joh","o") = min { d("Joh","")+1, d("Jo","o")+1, d("Jo","")+1 } = min { 4, 2, 3 } = 2
  4. d("John","o") = min { d("John","")+1, d("Joh","o")+1, d("Joh","")+1 } = min { 4, 3, 4 } = 3
And using this, we can compute:
  1. d("J","on") = min { d("J","o")+1, d("","on")+1, d("","o")+1 } = min { 2, 3, 2 } = 2
  2. d("Jo","on") = min { d("Jo","o")+1, d("J","on")+1, d("J","o")+1 } = min { 2, 2, 2 } = 2
  3. d("Joh","on") = min { d("Joh","o")+1, d("Jo","on")+1, d("Jo","o")+1 } = min { 3, 3, 2 } = 2
  4. d("John","on") = min { d("John","o")+1, d("Joh","on")+1, d("Joh","o")+0 } = min { 4, 3, 2 } = 2
Thus d("John","on") = 2, and if we trace back a path, d("John","on") = d("Joh","o") = d("Jo","o")+1 = d("J","")+1 = 1+1 = 2.  This corresponds to adding the "J" and "h" to "on".

We can visualize this algorithm as constructing a matrix row by row, thus the time complexity is \( O(NM) \). Only the previous row is needed in memory at all times, therefore having memory usage \( O(min(N,M)) \). However, if we wish to obtain not only the distance but also the operations needed to go from one string to another, we need to store the entire matrix, that is,  \( O(NM) \). Improvements to this algorithm also exist.

Generalization to formulas (trees)

Similarity between strings/sequences can be generalized to trees (and forests) by again considering edit distances. One can then add, remove, and swap nodes instead. Surprisingly, an algorithm with time complexity \( O(N^3) \) is known, which also reduces to \( O(N^2) \) for strings. Thus we can also (quite) efficiently compare mathematical formulas and logical expressions, although it's worth noting that these comparisons are only at a syntactic level: they carry no semantic information (for that, use a generalization order, such as subsumption). For example, the formulas "mortal(X)", "mortal(socrates)", "mortal(plato)" are all at a tree edit distance of one (simply swap X for socrates and plato), but the first formula implies the two others, whereas neither of the two latter imply each other.

Wednesday, February 5, 2014

Telescoping Sums

Consider generalizations to the closed form formula for arithmetic sums:
\sum_{k=1}^n k = 1+2+\ldots + n = \frac{n(n+1)}{2}
Here's the formula for the square sums:
\sum_{k=1}^n k^2 = 1^2+2^2+\ldots + n^2 = \frac{n(n+1)(2n+1)}{6}
And cube sums: \[
\sum_{k=1}^n k^3 = 1^3+2^3+\ldots + n^3 = \left( \frac{n(n+1)}{2} \right)^2 = \left(\sum_{k=1}^n k \right)^2
These formulas, among many other closed forms (e.g. for recurrence relations and binomial/hyperbolic sums) can be obtained using a deceptively simple technique: telescope sums. In its general form, it amounts to the following observation. For any function \( f \):
\sum_{k=1}^n f(k)-f(k-1) = f(n) - f(0)
since all intermediate terms cancel out.

Applying this simple principle to the function \( f(k)=k^2 \), we get:
\sum_{k=1}^n (k^2-(k-1)^2) = n^2
But we can also expand the square inside the sum:
\sum_{k=1}^n (k^2-(k-1)^2) = \sum_{k=1}^n (2k - 1) = 2 \sum_{k=1}^n k - n
Equating these two results, we obtain the closed form formula for arithmetic sums:
n^2 = 2 \sum_{k=1}^n k - n \iff \sum_{k=1}^n k = \frac{n(n+1)}{2}

We can take the same approach to obtain sums of exponents. For example, the square sum is obtained using the telescoping principle on \( f(k) = k^3 \):
\sum_{k=1}^n (k^3-(k-1)^3) = n^3
Expanding the cube inside the sum:
\sum_{k=1}^n (k^3-(k-1)^3) = \sum_{k=1}^n (3k^2-3k+1) = 3 \sum_{k=1}^n k^2 - 3\frac{n(n+1)}{2} + n
Equating these two results, we obtain the desired closed form formula:
n^3 = 3 \sum_{k=1}^n k^2 - 3\frac{n(n+1)}{2} + n \iff 6 \sum_{k=1}^n k^2 = 2n^3 + 3n(n+1) - 2n = n(n+1)(2n+1)

This gives a recursive method of computing \( \sum_{k=1}^n k^p \) for any positive integer \( p \).

To obtain a non-recursive method (that doesn't require computing all sums with lower exponents), note that in general, the sum with exponent \( p \) will be a polynomial of degree \( p+1 \), so another way to solve the sums is to performs an ansatz and compute the polynomial coefficients from particular values. 

Let's try the non-recursive algorithm to find the sum with \( p=4 \):
\sum_{k=1}^n k^4 = 1^4+2^4+\ldots + n^4
without first solving the sum for \( p=3 \).
For \( p=4 \), we know the closed form is a polynomial of degree \( p+1=5 \). So we set:
\sum_{k=1}^n k^4 = 1^4+2^4+\ldots + n^4 = a_5 n^5 + a_4 n^4 + a_3 n^3 + a_2 n^2 + a_1 n + a_0
We have 6 unknown coefficients, so we need (at least) 6 equations. These can be obtained by computing the quartic sum for \( n=1,2,\ldots,6 \):
n=1 \implies a_5 + a_4 + a_3 + a_2 + a_1 + a_0 = 1
n=2 \implies 2^5 a_5 + 2^4 a_4 + 2^3 a_3 + 2^2 a_2 + 2 a_1 + a_0 = 17
n=3 \implies 3^5 a_5 + 3^4 a_4 + 3^3 a_3 + 3^2 a_2 + 3^1 a_1 + a_0 = 98
and so on. 

Solving the linear system, we obtain: \( a_5=1/5, a_4=1/2, a_3=1/3,a_2=a_1=0, a_0=-1/30 \). You can either do this by hand (time consuming), or using a computer algebra system, or using this online calculator by copy-pasting the following:
a + b + c + d + t + f = 1
32*a + 16*b + 8*c + 4*d + 2*t + f = 17
243*a + 81*b + 27*c + 9*d + 3*t + f = 98
1024*a + 256*b + 64*c + 16*d + 4*t + f = 354
3125*a + 625*b + 125*c + 25*d + 5*t + f = 979
7776*a + 1296*b + 216*c + 36*d + 6*t + f = 2275

The identity is thus:
\sum_{k=1}^n k^4 = 1^4+2^4+\ldots + n^4 = \frac{n^5}{5} + \frac{n^4}{2} + \frac{n^3}{3} - \frac{1}{30}

This approach is taken far beyond sums of powers in the computer algebra development of Wilf-Zeilberger pairs.

Monday, December 9, 2013

Solving Logic Riddles using AI Programming

Consider three persons, Caesar, Brutus, and Cassius. One is a king, another is a bureaucrat, and the third is a spy. The king never lies, whereas the bureaucrat always lies. The spy either lies or tells the truth. Caesar claims that Cassius is a bureaucrat, Brutus claims that Caesar is a king, Cassius claims that he is a spy.

Who is who?

The solution, by hand

Let's start by solving this without computer assistance. Cassius cannot be a king: he is asserting that he is a spy, so if he were king, he would be lying - which is impossible for the king to do. Likewise, Brutus cannot be king, since he claims Caesar is, and so would be lying. So Caesar is the king. Brutus is then telling the truth, so he cannot be a bureaucrat (the Bureaucrat always lies). So Brutus must be a spy, and, by exclusion, Cassius is a bureaucrat (which is consistent, since Cassius is lying when asserting he is a spy). To summarize, there is a unique solution, which is:

Caesar: king.
Brutus: spy.
Cassius: bureaucrat.

The solution, AI programming

How would you encode information about this problem in such a way as to let the computer solve it? Sure, in this case the solution was easy to obtain, but if you had 100+ rules and a lot more people it would not be easy to solve the problem by hand (think of the problem as an equation with values true/false). Logic programming, as the name suggest, provides an intuitive and natural way to express logical riddles (among other things). The following in Prolog:

First, tell the computer Caesar, Brutus, and Cassius are persons:


Next, we would like to assign King, Liar (Bureaucrat is too long to type), and spy to these persons:

assign(K,L,S) :-
person(K), % king is a person
person(L), % liar is a person
person(S), % spy is a person
king(K,L,S), % constraints for king (to be defined later)
liar(K,L,S),  % constraints for liar
spy(K,L,S), % constraints for spy
different(K,L,S). % king, liar, and spy must be different people

We must thus defined the constraints imposed by virtue of being King, Liar, and Spy:

king(K,L,S) :- says(K,K,L,S). % Whatever King says is true
liar(K,L,S) :- \+ says(L,K,L,S). % Whatever Liar says is false (think of "\+" as "not true")
spy(_,_,_). % Anybody can be a spy

Now onto the claims that each of them make:

says(caesar,K,L,S) :- L = cassius. % Caesar says Cassius is Liar
says(brutus,K,L,S) :- K = caesar. % Brutus says Caesar is King
says(cassius,K,L,S) :- S = cassius. % Cassius says he himself is Spy

Finally, we define what "different" means:

different(X,Y,Z) :- X \= Y, X \= Z, Y \= Z. % Not the same: X,Y,Z 

Now we may query Prolog to obtain all solutions:

?- assign(K,L,S).
K = caesar, <--- Caesar is King
L = cassius, <--- Liar is Cassius
S = brutus ; <--- Spy is Brutus
false. <---- This means there are no more solutions

An interesting change

To see that this really works, consider what would happen if we don't require that there must be a unique King, Liar, and Spy. For example, one person could then be both King and Spy. This is effectively done by removing the requirement that they are all different: either remove different(K,L,S) from assign(K,L,S), or define it trivially as different(_,_,_), meaning there are no constraints).

Before executing the query, let's make some observations: a person cannot both the king and liar, assuming that person makes at least one claim. Since everybody makes a claim here, nobody can be both king and liar. So at least two people must be assigned the properties {king, liar, spy}. Clearly anybody can in principle be both king and spy, or liar and spy, since spy adds no new constraints. Furthermore, our original solution should still be part of the new solution set, since we've only removed one constraint.

How many solutions do you count? This problem is clearly more difficult than the previous one. Our query will reveal all solutions:

?- assign(K,L,S).
Solution 1:
K = S, S = caesar, <--- Caesar is both king and spy
L = cassius  <--- Cassius is liar

Solution 2 (This is the original solution):
K = caesar,
L = cassius,
S = brutus 

Solution 3:
K = S, S = cassius, <--- Cassius is king and spy
L = caesar <--- Caesar is Liar

Solution 4:
K = S, S = cassius, <--- As above, Cassius is king and spy
L = brutus <--- This time Brutus is the liar (instead of Caesar)

false. <--- No more solutions.

There is an asymmetry of solutions: when Cassius is both king and spy (solution 3 and 4), the liar can be Caesar or Brutus. But when Caesar is king and spy (solution 1), only Cassius can be a liar, not Brutus. This is because Brutus asserts that Caesar is king, and liars are never allowed to tell the truth.

Sunday, December 1, 2013

Which shape gives the largest area?

The Isoperimetric Inequality

Given a rope of length L, say 100 meters, what is the largest closed area you can make?
If we try a square, each side has to be L/4 (25 meters), and we'll end up with an area of (L/4)^2 = 625 square meters. If we however try a circle, the circumference is L=2πR, where R is the radius. Thus R = L/(2π), and the area is π(L/2π)^2 = L^2/(4π) = 796 square meters, which is better than the square.

But there's plenty more shapes to try, in fact, infinitely many. So which should we pick to maximize the area?
The ancients knew the answer but could not prove it: it's the circle. The proof of this is surprisingly simple using calculus. Let L be the length of the rope, and let A be the enclosed area.

Note that for a circle, the relationship between A and L is L=2πR and A=πR^2, where R is the radius, so that:

Given the belief that the circle gives the largest area, we expect that for any shape,

This is known as the isoperimetric inequality.

Hurwitz's Proof

Assume that the curve (x(t),y(t)) is parametrized so that it does one full revolution when t goes from 0 to π. The arc length is defined by s = Lt/(2π). Note that when t=0, s=0, and when t=2π, s=L.

We then have:

As for the area, we use Green's theorem:

Now, simply take the difference we want to show is always non-negative:

We want to conclude that this expression is non-negative for any curve. The first integral is clearly non-negative, since its integrand is a square. As for the second integral, the result is not obvious until one consider its Fourier series (this is called Wirtinger's inequality):

As we will see soon, for this to work, we must place the curve in coordinates such that:

This can always be done, as the placement of the curve is arbitrary: if the integral of y(t) is V, then the translated function y(t)-V/(2π) will have integral 0.

Next, differentiate y(t) to obtain:

Then, by Parseval's identity:

which concludes the proof. (The requirement a0=0 is to ensure that the last inequality holds for n=0).
The inequality thus obtained is known as the isoperimetric inequality:

Only the Circle Yields Maximum Area

As we saw in the beginning, the circle turns the inequality into an equality, thus yielding the maximum possible area. But this still leaves open the possibility that other shapes may be equally good. We can show that only the circle yields max area by considering when the inequality becomes an equality: precisely when the two integrals are zero.

Starting with the second integral, the inequality was obtained from comparing the square sums after applying Parseval's identity. To have equality, the terms in those sums must be 0 except when n=1, which means that

The first integral must also be zero, which means that

Integrating, we get

So indeed we must have a circle centered at (D,0).

The fact that the circle must be centered around y=0 is due to our previous choice of fixing the integral of y(t) over one period. The center of mass must be such that equally much is positive and negative, which for a circle clearly means fixing it around y=0.

How general is this proof?

The use of Green's theorem requires that we deal with a curve (x(t),y(t)) that has a continuous derivative except in finitely many points (piecewise smooth). Piecewise smoothness is also enough to ensure that the Fourier series can be differentiated term by term. Any area you can physically construct will certainly satisfy these conditions, as will anything you draw with a pen, since it is impossible to create or draw infinitely many edged points.