Sunday 23 February 2014

Edit Distance

Edit Distance

One of the classic problems solvable by dynamic programming is computing the “edit distance” between two strings: the minimum number of single-character insertions, deletions, or replacements required to convert one string into another.

For example, the edit distance between “Hello” and “Jello” is 1. The edit distance between “good” and “goodbye” is 3. The edit distance between any string and itself is 0.

The edit distance can be used for such purposes as suggesting, in a spell checker, a list of plausible replacements for a misspelled word. For each word not found in the dictionary (and therefore presumably misspelled), list all words in the dictionary that are a small edit distance way from the misspelling.

Breaking the Problem Down

We can compute the edit distance between two strings by working from the ends. The three operations available to us are
  1. Add one character to the end of a string
  2. Remove one character from the end of a string
  3. Change the character at the end of a string.
Suppose, for example, that we wanted to compute the edit distance between “Zeil” and “trials” (make up your own joke here). Starting with “Zeil”, we consider what would have to be done to get “trials” if the last step taken were “add”, “remove”, or “change”, respectively:
  1. If we knew how to convert “Zeil” to “trial”, we could add “s” to get the desired word.
  2. If we knew how to convert “Zei” to “trials”, then we would actually have “trialsl” and we could remove that last character to get the desired word.
  3. If we knew how to convert “Zei” to “trial”, then we would actually have “triall” and we could change the final “l” to “s” to get the desired word.
Now, being intelligent humans that can actually look ahead intuitively at a problem, you and I may think one or two of these alternatives are obvious losers, but bear with me. We're developing a general algorithm for not-so-intelligent computers.
Notice that what we have done is to reduce the original problem to 3 “smaller” problems: convert “Zeil” to “trial”, or “Zei” to “trials”, or “Zei” to “trial”.

Breaking it Down - Stage 2

We continue, recursively, to break down these problems:
  1. Convert “Zeil” to “trial”, then add “s” to get the desired word.
To convert “Zeil” to “trial”,
Add
Convert “Zeil” to “tria”, then add “l”
Remove
Convert “Zei” to “trial”, giving “triall”, then remove.
Change
Convert “Zei” to “tria”, giving “trial”, and no change is actually needed.
  1. Convert “Zei” to “trials”, giving “trialsl”, then remove the last character.
To convert “Zei” to “trials”,
Add
Convert “Zei” to “trial”, then add “s”
Remove
Convert “Ze” to “trials”, giving “trialsi”, then remove.
Change
Convert “Ze” to “trial”, giving “triali”, and change the final character.
  1. Convert “Zei” to “trial”, giving “triall”, then change the final “l” to “s”.
To convert “Zei” to “trial”,
Add
Convert “Zei” to “tria”, then add “l”
Remove
Convert “Ze” to “trial”, giving “triali”, then remove.
Change
Convert “Ze” to “tria”, giving “triai”, and change the final character.
Now we have nine subproblems to solve, but note that the strings involved are getting shorter. Eventually we will get down to subproblems involving an empty string, such as `Convert “” to “xyz”,' which can be trivially solved by a series of “Adds”.


A Recursive Solution

Here you see the recursive implementation of the edit distance calculation.
In the main portion, we don't know, off hand, whether the cheapest way to convert one string into another involves a final add, remove, or change, so we evaluate all three possibilities and return the minimum distance from among the three.
In each case, we recursively compute the distance (number of adds, removes, and changes) required to “set up” a final add, a final remove, or a final change. We add one to the add distance and the remove distance to account for the final add or remove. For the change distance, we add one only if the final characters in the strings are different (if not, no final change is required).

int editDistance (string x, string y)
{
  if (x == "")
     return y.length(); // base case
  else if (y == "")
     return x.length(); // base case
  else
    {
      int addDistance = editDistance(x, y.substr(0,y.length()-1)) + 1;
      int removeDistance = editDistance(x.substr(0,x.length()-1), y) + 1;
      int changeDistance = editDistance(x.substr(0,x.length()-1), 
                                        y.substr(0,y.length()-1)) 
        + (x[x.length()-1] == y[y.length()-1]) ? 0 : 1;
      return min(min(addDistance, removeDistance), changeDistance);
    }
}

A Dynamic Programming Solution

We can solve the same problem via dynamic programming by, again, reversing the direction so that we work the smaller subproblems first, keeping the answers in a table.

For example, in converting Zeil to trials, we start by forming a table of the cost (edit distance) to convert  to ttrtri, etc.:

trials
0123456

In other words, we need 0 steps to convert  to , 1 to convert  to t, 2 to convert  to tr, and so on.
Next, we add a row to describe the cost of converting Z to ttr, …, trials:

trials
0123456
Z1123456

OK, clearly we need 1 step to convert Z to . How are the other entries in this row computed?

Looking Closer
Let's back up just a bit:
trials
0123456
Z1?

What's the minimum cost to convert Z to t? It's the smallest of the three values computed as
Add
1 plus the cost of converting Z to  (we get this cost by looking to the left one position).
Remove
1 plus the cost of converting  to t, giving tZ (we get this cost by looking up one position).
Change
1 (because Z and t are different characters) plus the cost of converting  to  (we get this cost by looking diagonally up and to the left one position).
The last of these yields the minimal distance: 1.

Applying the General Rule
trials
0123456
Z11?
What's the minimum cost to convert Z to tr? It's the smallest of the three values computed as
Add
1 plus the cost of converting Z to t (we get this cost by looking to the left one position).
Remove
1 plus the cost of converting  to tr, giving trZ (we get this cost by looking up one position).
Change
1 (because Z and t are different characters) plus the cost of converting  to t (we get this cost by looking diagonally up and to the left one position).
The last of these yields the minimal distance: 2.
Got the idea? Try filling in the rest of the row before reading further.
Filling In The Table
trials
0123456
Z1123456
And we add the next row, using the same technique:
trials
0123456
Z1123456
e2223456
The row after that becomes a bit more interesting. When we get this far:
trials
0123456
Z1123456
e2223456
i333?
we are looking at the cost of converting Zei to tri. It's the smallest of the three values computed as
Add
1 plus the cost of converting Zei to tr (we get this cost by looking to the left one position).
Remove
1 plus the cost of converting Ze to tri, giving trii (we get this cost by looking up one position).
Change
Zero (because i and i are the same character) plus the cost of converting Ze to tr (we get this cost by looking diagonally up and to the left one position).
The last of these yields the minimal cost of 2.
trials
0123456
Z1123456
e2223456
i3332?
and then we can fill out the rest of the row:
trials
0123456
Z1123456
e2223456
i3332345
and finally, the last row of the table:
trials
0123456
Z1123456
e2223456
i3332345
l4443334
Note that this last row, again, has a situation where the cost of a change is zero plus the subproblem cost, because the two characters involved are the same (l).
From the lower right hand corner, then, we read out the edit distance between Zeil and trials as 4.

No comments:

Post a Comment