Approximate string matching is not a stranger problem.
I am learning and trying to understand how to solve it. I even now don't want to get too deep into it and just want to understand the brute-force way.
In its wiki page (Approximate string matching), it says
A brute-force approach would be to compute the edit distance to P (the pattern) for all substrings of T, and then choose the substring with the minimum distance. However, this algorithm would have the running time O(m * n^3), n is the length of T, m is the length of P
Ok. I understand this statement in the following way:
- We find out all possible substrings of T
- We compute the edit distance of each pair of strings {P, t1}, {P, t2}, ...
- We find out which substring has the shortest distance from P and this substring is the answer.
I have the following question:
a. I can use two for-loop to get all possible substrings and this requires O(n^2). So when I try to compute the edit distance of one substring and the patter, does it need O(n*m)? Why?
b. How exactly do I compute the distance of one pair (one substring and the patter)? I know I can insert, delete, substitute, but can anyone give me a algorithm that do just the calculation for one pair?
Thanks
Edit
Ok, I should use Levenshtein distance, but I don't quite understand its method.
Here is part of the code
for j from 1 to n
{
for i from 1 to m
{
if s[i] = t[j] then
d[i, j] := d[i-1, j-1] // no operation required
else
d[i, j] := minimum
(
d[i-1, j] + 1, // a deletion
d[i, j-1] + 1, // an insertion
d[i-1, j-1] + 1 // a substitution
)
}
}
So, assume I am now comparing {"suv", "svi"}.
So 'v' != 'i', then I have to see three other pairs:
{"su", "sv"}{"suv", "sv"}{"su", "svi"}
How can I understand this part? Why I need to see these 3 parts?
Does the distance between two prefixes mean that we need distance number of changes in order to make the two prefixes (or strings) equal?
So, let's take a look at {"su", "sv"}. We can see that distance of {"su", "sv"} is 1. Then how can {"su", "sv"} become {"suv", "svi"} by just adding 1? I think we need to insert 'v' into "su" and 'v' into "sv" and then substitute the last 'i' with 'v', which has 3 operations involved, right?