Wednesday, February 26, 2014

How similar are two words? Or two formulas?

The problem

Consider the word "John". How similar is it to "Jon", "Johan", and "on"?

This of course depends on precisely what we mean by similar. But we have an intuition that "John" and "Jon" should be more similar than say "John" and "Table". In particular, "similarity" should have something to do with the occurrence of same characters, as well the order in which they occur. But can we make this idea precise?

The Solution

Yes, we can make this idea precise. The key is to consider how many edits would be necessary to transform one word into another:
To go from "John" to "Jon", simply remove the "h". (Or conversely, start with "Jon" and add an "h".)
To go from "John" to "Johan", add an "a". (Or start with "Johan" and remove "a".)
To go from "John" to "on", remove the "J" and the "h". (Or start with "on" and add "J" and "h".)

Thus it requires one edit between "John" and "Jon", and likewise for "John" to "Johan", but two edits between "John" and "on". Clearly the fewer edits the more similar to words are. In particular, a word has zero edit distance to itself, and only to itself, so it is most similar to itself.

Here we have assumed that an edit is a removal or addition. Given two words of length N and M, in the worst case we have to remove all N letters and then add all M, giving an edit distance of N+M. This happens precisely when all letters are distinct, so nothing can be reused: "John" and "Strawberry" has edit distance 4+10=14 since they share no letters in common.

We could also consider other operations, such as swapping a letter for another. Thus the edit distance from "John" to "Joan" is one edit (swap "h" for "a") instead of two (remove "h", add "a"). Then the maximum edit distance between two words of length N and M is max(N,M), since we can always swap the first min(N,M) letters and then add the remaining max(N,M)-min(N,M). For example, from "John" to "Strawberry", start by swapping "J" to "S", "o" to "t", "h" to "r", and "n" to "a", giving "John" -> "Stra", then add the remainder of the word, "wberry", for a total of 10 edits.

When the edits are specifically addition, removal, and swapping of characters, it is known as the Levenshtein distance. For the remainder of this post, this is implicitly what we'll mean by "edit distance".

Now we know the minimium (0) and maximum (max(N,M)) edit distances. But how do we compute the distance in general? From a logical point of view, we could "try everything":

Let \( s_i \) be the substring obtaining by keeping the first i characters of a string s and cutting off the rest. For example, if \( s \) = "John", then \( s_0 \) = "" and \( s_2 \) = "Jo" . Note that \( s = s_4 \).

The edit distance between two strings \( s \) and \( t \) of length N and M, respectively, can be defined using two base cases, namely when the strings run out:
\[ d(s_0, t_j) = j \]
\[ d(s_i, t_0) = i \]
and a recursive case when \( i > 0 \) and \( j > 0 \):
\[ d(s_i, t_j) = \min \{ d(s_{i-1},t_j) + 1, d(s_i,t_{j-1}) + 1, d(s_{i-1},t_{j-1}) + [s[i] \neq t[j]]  \} \]
where \( [s[i] \neq t[j]] \) is 1 when character at position i in s is the same as character at position j in t, and 0 otherwise. The computation is started using \( d(s,t) = d(s_N, t_M) \).

The edit distance is a metric (as long as all edit costs are positive real numbers), so they do define a "distance" in a fairly intuitive sense of the word. In particular, the triangle inequality holds.

Efficient Implementation

The definition above is triply recursive, and hence direct implementation has exponential complexity.

A much more efficient implementation can be obtained by observing that for the recursive case, \( d(s_i, t_j) \) only needs to know the values of \( d(s_{i-1},t_j) \), \(  d(s_i,t_{j-1}) \), and \( d(s_{i-1},t_{j-1}) \).
Thus, if we start by computing all possible base cases \( d(s_0, t_j) = j \) for all \( 0 \leq j \leq M \) and \( d(s_i, t_M) = i \) for all \( 0 \leq i \leq N \), we can then "unfold" the recursive values all the way back until we get to \( d(s_0, t_0) \).

For example, consider the case s = "John" and t = "on".

By using base cases (or trivial observation), d("","on") = 2, d("","o") = 1, and d("","") = 0. Likewise, d("John","") = 4, d("Joh","") = 3, d("Jo","") = 2, d("J","") = 1.

Then, using the recursive formula:
  1. d("J","o") = 1+d("","") = 1
  2. d("Jo",o") = min{ d("Jo","")+1, d("J","o")+1, d("J","")+0 } = min { 3, 2, 1 } = 1
  3. d("Joh","o") = min { d("Joh","")+1, d("Jo","o")+1, d("Jo","")+1 } = min { 4, 2, 3 } = 2
  4. d("John","o") = min { d("John","")+1, d("Joh","o")+1, d("Joh","")+1 } = min { 4, 3, 4 } = 3
And using this, we can compute:
  1. d("J","on") = min { d("J","o")+1, d("","on")+1, d("","o")+1 } = min { 2, 3, 2 } = 2
  2. d("Jo","on") = min { d("Jo","o")+1, d("J","on")+1, d("J","o")+1 } = min { 2, 2, 2 } = 2
  3. d("Joh","on") = min { d("Joh","o")+1, d("Jo","on")+1, d("Jo","o")+1 } = min { 3, 3, 2 } = 2
  4. d("John","on") = min { d("John","o")+1, d("Joh","on")+1, d("Joh","o")+0 } = min { 4, 3, 2 } = 2
Thus d("John","on") = 2, and if we trace back a path, d("John","on") = d("Joh","o") = d("Jo","o")+1 = d("J","")+1 = 1+1 = 2.  This corresponds to adding the "J" and "h" to "on".

We can visualize this algorithm as constructing a matrix row by row, thus the time complexity is \( O(NM) \). Only the previous row is needed in memory at all times, therefore having memory usage \( O(min(N,M)) \). However, if we wish to obtain not only the distance but also the operations needed to go from one string to another, we need to store the entire matrix, that is,  \( O(NM) \). Improvements to this algorithm also exist.

Generalization to formulas (trees)

Similarity between strings/sequences can be generalized to trees (and forests) by again considering edit distances. One can then add, remove, and swap nodes instead. Surprisingly, an algorithm with time complexity \( O(N^3) \) is known, which also reduces to \( O(N^2) \) for strings. Thus we can also (quite) efficiently compare mathematical formulas and logical expressions, although it's worth noting that these comparisons are only at a syntactic level: they carry no semantic information (for that, use a generalization order, such as subsumption). For example, the formulas "mortal(X)", "mortal(socrates)", "mortal(plato)" are all at a tree edit distance of one (simply swap X for socrates and plato), but the first formula implies the two others, whereas neither of the two latter imply each other.


Wednesday, February 5, 2014

Telescoping Sums

Consider generalizations to the closed form formula for arithmetic sums:
\[
\sum_{k=1}^n k = 1+2+\ldots + n = \frac{n(n+1)}{2}
\]
Here's the formula for the square sums:
\[
\sum_{k=1}^n k^2 = 1^2+2^2+\ldots + n^2 = \frac{n(n+1)(2n+1)}{6}
\]
And cube sums: \[
\sum_{k=1}^n k^3 = 1^3+2^3+\ldots + n^3 = \left( \frac{n(n+1)}{2} \right)^2 = \left(\sum_{k=1}^n k \right)^2
\]
These formulas, among many other closed forms (e.g. for recurrence relations and binomial/hyperbolic sums) can be obtained using a deceptively simple technique: telescope sums. In its general form, it amounts to the following observation. For any function \( f \):
\[
\sum_{k=1}^n f(k)-f(k-1) = f(n) - f(0)
\]
since all intermediate terms cancel out.

Applying this simple principle to the function \( f(k)=k^2 \), we get:
\[
\sum_{k=1}^n (k^2-(k-1)^2) = n^2
\]
But we can also expand the square inside the sum:
\[
\sum_{k=1}^n (k^2-(k-1)^2) = \sum_{k=1}^n (2k - 1) = 2 \sum_{k=1}^n k - n
\]
Equating these two results, we obtain the closed form formula for arithmetic sums:
\[
n^2 = 2 \sum_{k=1}^n k - n \iff \sum_{k=1}^n k = \frac{n(n+1)}{2}
\]

We can take the same approach to obtain sums of exponents. For example, the square sum is obtained using the telescoping principle on \( f(k) = k^3 \):
\[
\sum_{k=1}^n (k^3-(k-1)^3) = n^3
\]
Expanding the cube inside the sum:
\[
\sum_{k=1}^n (k^3-(k-1)^3) = \sum_{k=1}^n (3k^2-3k+1) = 3 \sum_{k=1}^n k^2 - 3\frac{n(n+1)}{2} + n
\]
Equating these two results, we obtain the desired closed form formula:
\[
n^3 = 3 \sum_{k=1}^n k^2 - 3\frac{n(n+1)}{2} + n \iff 6 \sum_{k=1}^n k^2 = 2n^3 + 3n(n+1) - 2n = n(n+1)(2n+1)
\]

This gives a recursive method of computing \( \sum_{k=1}^n k^p \) for any positive integer \( p \).

To obtain a non-recursive method (that doesn't require computing all sums with lower exponents), note that in general, the sum with exponent \( p \) will be a polynomial of degree \( p+1 \), so another way to solve the sums is to performs an ansatz and compute the polynomial coefficients from particular values. 

Let's try the non-recursive algorithm to find the sum with \( p=4 \):
\[
\sum_{k=1}^n k^4 = 1^4+2^4+\ldots + n^4
\]
without first solving the sum for \( p=3 \).
For \( p=4 \), we know the closed form is a polynomial of degree \( p+1=5 \). So we set:
\[
\sum_{k=1}^n k^4 = 1^4+2^4+\ldots + n^4 = a_5 n^5 + a_4 n^4 + a_3 n^3 + a_2 n^2 + a_1 n + a_0
\]
We have 6 unknown coefficients, so we need (at least) 6 equations. These can be obtained by computing the quartic sum for \( n=1,2,\ldots,6 \):
\[
n=1 \implies a_5 + a_4 + a_3 + a_2 + a_1 + a_0 = 1
\]
\[
n=2 \implies 2^5 a_5 + 2^4 a_4 + 2^3 a_3 + 2^2 a_2 + 2 a_1 + a_0 = 17
\]
\[
n=3 \implies 3^5 a_5 + 3^4 a_4 + 3^3 a_3 + 3^2 a_2 + 3^1 a_1 + a_0 = 98
\]
and so on. 

Solving the linear system, we obtain: \( a_5=1/5, a_4=1/2, a_3=1/3,a_2=a_1=0, a_0=-1/30 \). You can either do this by hand (time consuming), or using a computer algebra system, or using this online calculator by copy-pasting the following:
a + b + c + d + t + f = 1
32*a + 16*b + 8*c + 4*d + 2*t + f = 17
243*a + 81*b + 27*c + 9*d + 3*t + f = 98
1024*a + 256*b + 64*c + 16*d + 4*t + f = 354
3125*a + 625*b + 125*c + 25*d + 5*t + f = 979
7776*a + 1296*b + 216*c + 36*d + 6*t + f = 2275

The identity is thus:
\[
\sum_{k=1}^n k^4 = 1^4+2^4+\ldots + n^4 = \frac{n^5}{5} + \frac{n^4}{2} + \frac{n^3}{3} - \frac{1}{30}
\]


This approach is taken far beyond sums of powers in the computer algebra development of Wilf-Zeilberger pairs.