Find the longest common subsequence algorithm - low speed

Question

I've designed an algorithm to find the longest common subsequence.

These are steps:

Pick the first letter in the first string.
Look for it in the second string and if its found, Add that letter to common_subsequence and store its position in index, Otherwise compare the length of common_subsequence with the length of lcs and if its greater, asign its value to lcs.
Return to the first string and pick the next letter and repeat the previous step again, But this time start searching from indexth letter
Repeat this process until there is no letter in the first string to pick. At the end the value of lcs is the Longest Common Subsequence.

This is an example:

X=A, B, C, B, D, A, B‬‬  
‫‪Y=B, D, C, A, B, A‬‬

Pick A in the first string.
Look for A in Y.
Now that there is an A in the second string, append it to common_subsequence.
Return to the first string and pick the next letter that is B.
Look for B in the second string this time starting from the position of A.
There is a B after A so append B to common_subsequence.
Now pick the next letter in the first string that is C. There isn't a C next to B in the second string. So assign the value of common_subsequence to lcs because its length is greater than the length of lcs.

Repeat the previous steps until reaching the end of the first string. In the end the value of lcs is the Longest Common Subsequence.

The complexity of this algorithm is \$\theta(n*m)\$.

I implemented it on two methods. The second one is using a hash table, but after implementation I found it's much slower compared to the first algorithm. I can't understand why.

The first algorithm:

import time
def lcs(xstr, ystr):
    if not (xstr and ystr): return # if string is empty
    lcs = [''] #  longest common subsequence
    lcslen = 0 # length of longest common subsequence so far
    for i in xrange(len(xstr)):
        cs = '' # common subsequence
        start = 0 # start position in ystr
        for item in xstr[i:]:
            index = ystr.find(item, start) # position at the common letter
            if index != -1: # if common letter is found
                cs += item # add common letter to the cs
                start = index + 1
            if index == len(ystr) - 1: break # if reached to the end of ystr
        # updates lcs and lcslen if found better cs
        if len(cs) > lcslen: lcs, lcslen = [cs], len(cs) 
        elif len(cs) == lcslen: lcs.append(cs)
    return lcs

file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()

start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed

The second one using hash table:

import time
from collections import defaultdict
def lcs(xstr, ystr):
    if not (xstr and ystr): return # if strings are empty
    lcs = [''] #  longest common subsequence
    lcslen = 0 # length of longest common subsequence so far
    location = defaultdict(list) # keeps track of items in the ystr
    i = 0
    for k in ystr:
        location[k].append(i)
        i += 1
    for i in xrange(len(xstr)):
        cs = '' # common subsequence
        index = -1
        reached_index = defaultdict(int)
        for item in xstr[i:]:
            for new_index in location[item][reached_index[item]:]:
                reached_index[item] += 1
                if index < new_index:
                    cs += item # add item to the cs
                    index = new_index
                    break
            if index == len(ystr) - 1: break # if reached to the end of ystr
        # update lcs and lcslen if found better cs
        if len(cs) > lcslen: lcs, lcslen = [cs], len(cs) 
        elif len(cs) == lcslen: lcs.append(cs)
    return lcs

file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()

start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed

Winston Ewert · Accepted Answer · 2013-01-09 20:31:38Z

3

Firstly, your algorithm is incorrect try:

lcs("AAAABCC","AAAACCB"), the LCS should be "AAAACC", but your algorithm finds "AAAAB".

Secondly your algorithm is O(n^2*m) not O(n*m). Since you don't elaborate as to why you think your algorithm is theta(n*m) I can't really guess where your analysis has gone wrong.

Your second version attempts to optimize the process of searching through the string by using a list of pre-calculated positions. This means you don't have to scan through all the positions in the string with different characters. However, you lose the ability to skip all the position before your starting index. For long strings with few distinct characters, you end up losing out.

answered Jan 9, 2013 at 20:31

Winston Ewert

30.7k4 gold badges52 silver badges79 bronze badges

\$\begingroup\$ I made a little changes to my algorithm. now it passes your test case. i uploaded it here: pastebin.com/030Uhpcr .the only change that i made is that it calls the function two time. first lcs(xstr, ystr) and second lcs(ystr, xstr). but i still think that its complexity is theta(n*m). because it loops through second string n times. \$\endgroup\$

Sajad Rastegar
– Sajad Rastegar

2013-01-10 13:46:51 +00:00
Commented Jan 10, 2013 at 13:46
\$\begingroup\$ @Rastegar, your algorithm is still incorrect, try: "AAAABCCD" and "AAAADCCB". As for complexity, ystr.find is called n*(n/2) times, or O(n^2). The complexity of ystr.find is O(m), thus the cost is O(n^2*m). It doesn't loop through the second string m times, because you've got two nested for loops there, not one. \$\endgroup\$

Winston Ewert
– Winston Ewert

2013-01-10 14:14:15 +00:00
Commented Jan 10, 2013 at 14:14
\$\begingroup\$ Ok, It seems that my program certainly made fail. but it's complexity was theta(n*m) becuase ystr.find(item, start) doesn't start searching from the beginning of the list but it starts from start where it found the common letter in the last searching. and after getting the end of ystr, exits from the second loop. \$\endgroup\$

Sajad Rastegar
– Sajad Rastegar

2013-01-10 15:55:36 +00:00
Commented Jan 10, 2013 at 15:55
\$\begingroup\$ @Rastegar, ok I missed a subtlety in your algorithm. I thought start was being reset more then it was. So yes it appears to be theta(n*m) but that's all moot because it doesn't work. \$\endgroup\$

Winston Ewert
– Winston Ewert

2013-01-10 18:11:24 +00:00
Commented Jan 10, 2013 at 18:11

Add a comment |

Stack Exchange Network

Find the longest common subsequence algorithm - low speed

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Find the longest common subsequence algorithm - low speed

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions