Edit Distance
One of the classic problems solvable by dynamic programming
is computing the “edit distance” between two strings: the minimum number of
single-character insertions, deletions, or replacements required to convert one
string into another.
For example, the edit distance between “Hello” and “Jello”
is 1. The edit distance between “good” and “goodbye” is 3. The edit distance
between any string and itself is 0.
The edit distance can be used for such purposes as
suggesting, in a spell checker, a list of plausible replacements for a
misspelled word. For each word not found in the dictionary (and therefore
presumably misspelled), list all words in the dictionary that are a small edit
distance way from the misspelling.
Breaking the Problem Down
We can compute the edit distance between two strings by
working from the ends. The three operations available to us are
- Add
one character to the end of a string
- Remove
one character from the end of a string
- Change
the character at the end of a string.
Suppose, for example, that we wanted to compute the edit
distance between “Zeil” and “trials” (make up your own joke
here). Starting with “Zeil”, we consider what would have to be done to get “trials” if
the last step taken were “add”, “remove”, or “change”,
respectively:
- If we
knew how to convert “Zeil” to “trial”, we could add “s” to
get the desired word.
- If we
knew how to convert “Zei” to “trials”, then we would
actually have “trialsl” and we could remove that
last character to get the desired word.
- If we
knew how to convert “Zei” to “trial”, then we would
actually have “triall” and we could change the
final “l” to “s” to get the desired word.
Now, being intelligent humans that can actually look ahead
intuitively at a problem, you and I may think one or two of these alternatives
are obvious losers, but bear with me. We're developing a general algorithm for
not-so-intelligent computers.
Notice that what we have done is to reduce the original
problem to 3 “smaller” problems: convert “Zeil” to “trial”,
or “Zei” to “trials”, or “Zei” to “trial”.
Breaking it Down - Stage 2
We continue, recursively, to break down these problems:
- Convert “Zeil” to “trial”,
then add “s” to get the desired word.
To convert “Zeil” to “trial”,
Add
Convert “Zeil” to “tria”,
then add “l”
Remove
Convert “Zei” to “trial”,
giving “triall”, then remove.
Change
Convert “Zei” to “tria”,
giving “trial”, and no change is actually needed.
- Convert “Zei” to “trials”,
giving “trialsl”, then remove the last character.
To convert “Zei” to “trials”,
Add
Convert “Zei” to “trial”,
then add “s”
Remove
Convert “Ze” to “trials”,
giving “trialsi”, then remove.
Change
Convert “Ze” to “trial”,
giving “triali”, and change the final character.
- Convert “Zei” to “trial”,
giving “triall”, then change the final “l” to “s”.
To convert “Zei” to “trial”,
Add
Convert “Zei” to “tria”,
then add “l”
Remove
Convert “Ze” to “trial”,
giving “triali”, then remove.
Change
Convert “Ze” to “tria”,
giving “triai”, and change the final character.
Now we have nine subproblems to solve, but note that the
strings involved are getting shorter. Eventually we will get down to
subproblems involving an empty string, such as `Convert “” to “xyz”,'
which can be trivially solved by a series of “Adds”.
A Recursive Solution
Here you see the recursive implementation of the edit
distance calculation.
In the main portion, we don't know, off hand, whether the
cheapest way to convert one string into another involves a final add, remove,
or change, so we evaluate all three possibilities and return the minimum
distance from among the three.
In each case, we recursively compute the distance (number of
adds, removes, and changes) required to “set up” a final add, a final
remove, or a final change. We add one to the add distance and the remove
distance to account for the final add or remove. For the change distance, we
add one only if the final characters in the strings are different (if not, no
final change is required).
int editDistance (string x, string y) { if (x == "") return y.length(); // base case else if (y == "") return x.length(); // base case else { int addDistance = editDistance(x, y.substr(0,y.length()-1)) + 1; int removeDistance = editDistance(x.substr(0,x.length()-1), y) + 1; int changeDistance = editDistance(x.substr(0,x.length()-1), y.substr(0,y.length()-1)) + (x[x.length()-1] == y[y.length()-1]) ? 0 : 1; return min(min(addDistance, removeDistance), changeDistance); } }
A Dynamic Programming Solution
We can solve the same problem via dynamic programming by, again, reversing the direction so that we work the smaller subproblems first, keeping the answers in a table.
For example, in converting “Zeil” to “trials”, we start by forming a table of the cost (edit distance) to convert “” to “”, “t”, “tr”, “tri”, etc.:
In other words, we need 0 steps to convert “” to “”, 1 to convert “” to “t”, 2 to convert “” to “tr”, and so on.
Next, we add a row to describe the cost of converting “Z” to “”, “t”, “tr”, …, “trials”:
OK, clearly we need 1 step to convert “Z” to “”. How are the other entries in this row computed?
For example, in converting “Zeil” to “trials”, we start by forming a table of the cost (edit distance) to convert “” to “”, “t”, “tr”, “tri”, etc.:
t | r | i | a | l | s | ||
---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 |
In other words, we need 0 steps to convert “” to “”, 1 to convert “” to “t”, 2 to convert “” to “tr”, and so on.
Next, we add a row to describe the cost of converting “Z” to “”, “t”, “tr”, …, “trials”:
t | r | i | a | l | s | ||
---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
Z | 1 | 1 | 2 | 3 | 4 | 5 | 6 |
OK, clearly we need 1 step to convert “Z” to “”. How are the other entries in this row computed?
Looking Closer
Let's back up just a bit:
What's the minimum cost to convert “Z” to “t”? It's the smallest of the three values computed as
The last of these yields the minimal distance: 1.
t | r | i | a | l | s | ||
---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
Z | 1 | ? |
What's the minimum cost to convert “Z” to “t”? It's the smallest of the three values computed as
- Add
- 1 plus the cost of converting “Z” to “” (we get this cost by looking to the left one position).
- Remove
- 1 plus the cost of converting “” to “t”, giving “tZ” (we get this cost by looking up one position).
- Change
- 1 (because “Z” and “t” are different characters) plus the cost of converting “” to “” (we get this cost by looking diagonally up and to the left one position).
Applying the General Rule
t | r | i | a | l | s | ||
---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
Z | 1 | 1 | ? |
- Add
- 1 plus the cost of converting “Z” to “t” (we get this cost by looking to the left one position).
- Remove
- 1 plus the cost of converting “” to “tr”, giving “trZ” (we get this cost by looking up one position).
- Change
- 1 (because “Z” and “t” are different characters) plus the cost of converting “” to “t” (we get this cost by looking diagonally up and to the left one position).
Got the idea? Try filling in the rest of the row before reading further.
t | r | i | a | l | s | ||
---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
Z | 1 | 1 | 2 | 3 | 4 | 5 | 6 |
t | r | i | a | l | s | ||
---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
Z | 1 | 1 | 2 | 3 | 4 | 5 | 6 |
e | 2 | 2 | 2 | 3 | 4 | 5 | 6 |
t | r | i | a | l | s | ||
---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
Z | 1 | 1 | 2 | 3 | 4 | 5 | 6 |
e | 2 | 2 | 2 | 3 | 4 | 5 | 6 |
i | 3 | 3 | 3 | ? |
- Add
- 1 plus the cost of converting “Zei” to “tr” (we get this cost by looking to the left one position).
- Remove
- 1 plus the cost of converting “Ze” to “tri”, giving “trii” (we get this cost by looking up one position).
- Change
- Zero (because “i” and “i” are the same character) plus the cost of converting “Ze” to “tr” (we get this cost by looking diagonally up and to the left one position).
t | r | i | a | l | s | ||
---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
Z | 1 | 1 | 2 | 3 | 4 | 5 | 6 |
e | 2 | 2 | 2 | 3 | 4 | 5 | 6 |
i | 3 | 3 | 3 | 2 | ? |
t | r | i | a | l | s | ||
---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
Z | 1 | 1 | 2 | 3 | 4 | 5 | 6 |
e | 2 | 2 | 2 | 3 | 4 | 5 | 6 |
i | 3 | 3 | 3 | 2 | 3 | 4 | 5 |
t | r | i | a | l | s | ||
---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
Z | 1 | 1 | 2 | 3 | 4 | 5 | 6 |
e | 2 | 2 | 2 | 3 | 4 | 5 | 6 |
i | 3 | 3 | 3 | 2 | 3 | 4 | 5 |
l | 4 | 4 | 4 | 3 | 3 | 3 | 4 |
From the lower right hand corner, then, we read out the edit distance between “Zeil” and “trials” as 4.
No comments:
Post a Comment