Concrete algorithm code for approximate string matching

Question

Approximate string matching is not a stranger problem.

I am learning and trying to understand how to solve it. I even now don't want to get too deep into it and just want to understand the brute-force way.

In its wiki page (Approximate string matching), it says

A brute-force approach would be to compute the edit distance to P (the pattern) for all substrings of T, and then choose the substring with the minimum distance. However, this algorithm would have the running time O(m * n^3), n is the length of T, m is the length of P

Ok. I understand this statement in the following way:

We find out all possible substrings of T
We compute the edit distance of each pair of strings {P, t1}, {P, t2}, ...
We find out which substring has the shortest distance from P and this substring is the answer.

I have the following question:

a. I can use two for-loop to get all possible substrings and this requires O(n^2). So when I try to compute the edit distance of one substring and the patter, does it need O(n*m)? Why?

b. How exactly do I compute the distance of one pair (one substring and the patter)? I know I can insert, delete, substitute, but can anyone give me a algorithm that do just the calculation for one pair?

Thanks

Edit

Ok, I should use Levenshtein distance, but I don't quite understand its method.

Here is part of the code

for j from 1 to n
{
    for i from 1 to m
    {
      if s[i] = t[j] then  
        d[i, j] := d[i-1, j-1]       // no operation required
      else
        d[i, j] := minimum
                   (
                     d[i-1, j] + 1,  // a deletion
                     d[i, j-1] + 1,  // an insertion
                     d[i-1, j-1] + 1 // a substitution
                   )
    }
  }

So, assume I am now comparing {"suv", "svi"}.

So 'v' != 'i', then I have to see three other pairs:

{"su", "sv"}
{"suv", "sv"}
{"su", "svi"}

How can I understand this part? Why I need to see these 3 parts?

Does the distance between two prefixes mean that we need distance number of changes in order to make the two prefixes (or strings) equal?

So, let's take a look at {"su", "sv"}. We can see that distance of {"su", "sv"} is 1. Then how can {"su", "sv"} become {"suv", "svi"} by just adding 1? I think we need to insert 'v' into "su" and 'v' into "sv" and then substitute the last 'i' with 'v', which has 3 operations involved, right?

The most common algorithm for word pair distance is en.wikipedia.org/wiki/Levenshtein_distance — biziclop
– biziclop, Commented May 28, 2012 at 22:52
@biziclop: I didn't see your comment in time - post it as an answer instead, and I'll delete mine. — Aasmund Eldhuset
– Aasmund Eldhuset, Commented May 28, 2012 at 22:54

Aasmund Eldhuset · Accepted Answer · 2012-05-29 16:20:26Z

1

The standard way of measuring the edit distance between two strings is called Levenshtein distance - the wikipedia page contains pseudocode for the algorithm.

As for your edit: You need to look at {"su", "sv"} because it is possible that the best way to change "suv" into "svi" is to replace the last v by i, whose cost will come on top of the cost for changing "su" to "sv". Or, it could be that the best way is to change "suv" into "sv" somehow and then add an i. Or, it could be that the best way is to first delete the v from "suv" and then change "su" into "svi". The first way turns out to be best (or as good as the other options) in this case. The edit distance is indeed 2, and the operations are to change the u into a v and the v into an i.

edited May 29, 2012 at 16:20

answered May 28, 2012 at 22:53

Aasmund Eldhuset

38.1k4 gold badges74 silver badges85 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

biziclop Over a year ago

And from that pseudocode you can also see why it takes n*m steps.

Collectives™ on Stack Overflow

Concrete algorithm code for approximate string matching

Edit

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Edit

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related