I've designed an algorithm to find the longest common subsequence.
These are steps:
Pick the first letter in the first string.
Look for it in the second string and if its found, Add that letter to
common_subsequenceand store its position inindex, Otherwise compare the length ofcommon_subsequencewith the length oflcsand if its greater, asign its value tolcs.Return to the first string and pick the next letter and repeat the previous step again, But this time start searching from
indexth letterRepeat this process until there is no letter in the first string to pick. At the end the value of
lcsis the Longest Common Subsequence.
This is an example:
X=A, B, C, B, D, A, B
Y=B, D, C, A, B, A
- Pick
Ain the first string. - Look for
AinY. - Now that there is an
Ain the second string, append it tocommon_subsequence. - Return to the first string and pick the next letter that is
B. - Look for
Bin the second string this time starting from the position ofA. - There is a
BafterAso append B tocommon_subsequence. - Now pick the next letter in the first string that is
C. There isn't aCnext toBin the second string. So assign the value of common_subsequence tolcsbecause its length is greater than the length oflcs.
Repeat the previous steps until reaching the end of the first string. In the end the value of lcs is the Longest Common Subsequence.
The complexity of this algorithm is \$\theta(n*m)\$.
I implemented it on two methods. The second one is using a hash table, but after implementation I found it's much slower compared to the first algorithm. I can't understand why.
The first algorithm:
import time
def lcs(xstr, ystr):
if not (xstr and ystr): return # if string is empty
lcs = [''] # longest common subsequence
lcslen = 0 # length of longest common subsequence so far
for i in xrange(len(xstr)):
cs = '' # common subsequence
start = 0 # start position in ystr
for item in xstr[i:]:
index = ystr.find(item, start) # position at the common letter
if index != -1: # if common letter is found
cs += item # add common letter to the cs
start = index + 1
if index == len(ystr) - 1: break # if reached to the end of ystr
# updates lcs and lcslen if found better cs
if len(cs) > lcslen: lcs, lcslen = [cs], len(cs)
elif len(cs) == lcslen: lcs.append(cs)
return lcs
file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()
start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed
The second one using hash table:
import time
from collections import defaultdict
def lcs(xstr, ystr):
if not (xstr and ystr): return # if strings are empty
lcs = [''] # longest common subsequence
lcslen = 0 # length of longest common subsequence so far
location = defaultdict(list) # keeps track of items in the ystr
i = 0
for k in ystr:
location[k].append(i)
i += 1
for i in xrange(len(xstr)):
cs = '' # common subsequence
index = -1
reached_index = defaultdict(int)
for item in xstr[i:]:
for new_index in location[item][reached_index[item]:]:
reached_index[item] += 1
if index < new_index:
cs += item # add item to the cs
index = new_index
break
if index == len(ystr) - 1: break # if reached to the end of ystr
# update lcs and lcslen if found better cs
if len(cs) > lcslen: lcs, lcslen = [cs], len(cs)
elif len(cs) == lcslen: lcs.append(cs)
return lcs
file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()
start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed