12

I am trying to find the longest common subsequence between two strings.

I watched this tutoial https://www.youtube.com/watch?v=NnD96abizww

and wrote:

# Longest Common Subsequence

def lcs(s1, s2):
    matrix = [ [0 for x in range(len(s2))] for x in range(len(s1)) ]
    cs = ""
    for i in range(len(s1)):
        for j in range(len(s2)):
            if s1[i]==s2[j]:
                if i==0 or j==0:
                    matrix[i][j] = 1
                    cs += s1[i]
                else:
                    matrix[i][j] = matrix[i-1][j-1] + 1
                    cs += s1[i]
            else:
                if i==0 or j==0:
                    matrix[i][j] = 0
                else:
                    matrix[i][j] = max(matrix[i-1][j], matrix[i][j-1])

    return matrix[len(s1)-1][len(s2)-1], cs


print(lcs("abcdaf", "acbcf"))  



I get (3, 'abccaf')

This is clearly wrong it should be 4 abcf.

Not sure what step is going wrong. One general question is how long does it take usually for programmers to "get" these kind of questions?

1
  • 1
    Try to follow the pseudocode here more carefully: en.wikipedia.org/wiki/Longest_common_subsequence_problem I think the problem may have to do with not having the zero-index of the matrix left at zero to represent the empty intersection. You may also want to study the explanation about backtracking there. Commented Feb 6, 2018 at 22:27

4 Answers 4

13

There are 2 main problems with your code that cause the algorithm to output the wrong answer.

if i == 0 or j == 0 in line 16

Just following the video shows that this line makes no sense when s1[1] != s2[j], because the longest common subsequence of "ab" and "a" has length 1 although your algorithm sets matrix[0][1] = 0 for this example. So you need to remove this if statement. While your at it you have to consider what max(matrix[i-1][j], matrix[i][j-1]) does for i == 0 or j == 0. Now there are two different approaches:

  1. The explicit one:

    max(matrix[i-1][j] if i != 0 else 0, 
        matrix[i][j-1] if j != 0 else 0)
    
  2. The implicit one:

    max(matrix[i-1][j], matrix[i][j-1])
    

    This one works, because in Python negative indices are used to get the last item of a list and these items are 0 in this case.

cs += s1[i] in line 11/14

For example if you found that the longest common subsequence of "a" and "abcd" is "a", your algorithm sets the longest common subsequence for "a" and "abcda" as "aa", which doesn't make sense. I am struggling to explain why it does not work like that, so i suggest you look at a few examples, maybe using http://pythontutor.com/visualize.html

Solution

To approach both problems you can use the matrix to store the longest common subsequence you found for the smaller problems. You end up with this:

def lcs(s1, s2):
    matrix = [["" for x in range(len(s2))] for x in range(len(s1))]
    for i in range(len(s1)):
        for j in range(len(s2)):
            if s1[i] == s2[j]:
                if i == 0 or j == 0:
                    matrix[i][j] = s1[i]
                else:
                    matrix[i][j] = matrix[i-1][j-1] + s1[i]
            else:
                matrix[i][j] = max(matrix[i-1][j], matrix[i][j-1], key=len)

    cs = matrix[-1][-1]

    return len(cs), cs

print(lcs("abcdaf", "acbcf"))  

This specific implementation only returns one possible result. You can try to implement an algorithm that gives all of the longest common sequences as an exercise. Maybe have a look at the Wikipedia page as suggested by גלעד ברקן

How long does it take to "get" why your code does not work?

There is obviously no clear answer. It always helps to think about examples and in the case of algorithms Wikipedia often has a good pseudocode, which you can base your implementations on. When you are familiar with the concepts and datastructures involved in the algorithm, you should be able to implement it within a day, I would say (but I am definitely no expert). In general searching for logical bugs in your code, can take multiple days, depending on the size of the code. To practice this kind of structured, algorithmic and mathematical thinking I can highly recommend projecteuler.net.

Sign up to request clarification or add additional context in comments.

5 Comments

This code appears to produce wrong answer for me. I have tested it with named tuples.
@ikamen Can you provide a short example for your problem? The code assumes that s1 and s2 are strings. Using collections.namedtuples as parameters is not intended, so it is entirely possible that this code breaks.
Well, if it is not intended to work on tuples, then there is nothing to fix there. My test was just passing two list with permutations of 5 named tuples, and comparing with right answer.
Doesn't always work if there are duplicates in input strings: "abcadaf" and "acabcf"
@EL What exactly does not work? I get 'abcf' as the result for your example. That is a common subsequence of length 4, there are of course others of length 4, but did you find any of length > 4?
12

For those who look for a built-in solution:

from difflib import SequenceMatcher

str_a = "xBCDxFGxxxKLMx"
str_b = "aBCDeFGhijKLMn"
s = SequenceMatcher(None, str_a, str_b)

lcs = ''.join([str_a[block.a:(block.a + block.size)] for block in s.get_matching_blocks()])
# lcs = 'BCDFGKLM'

3 Comments

Note: this does NOT actually return the longest common subsequence, but return a 'human readable' string diff. See docs for details: docs.python.org/3/library/difflib.html
Thank you for pointing that out, I had no idea. What if we use the find_longest_match on the SequenceMatcher instance, like in this example? --> tutorialspoint.com/…
This works for strings shorter than 200 characters. SequenceMatcher doesn't work for longer strings. I had to use BurningKarl's solution.
0

By just getting the length of LCS using following code segment, you will find the max length is 14, so BurningKarl's solution works for me.

def longestCommonSequence(s1, s2, s1Index, s2Index, arr):
    if s1Index ==-1 or s2Index == -1:
        return 0
    if(arr[s1Index][s2Index] != None):
        return arr[s1Index][s2Index]

    if s1[s1Index] == s2 [s2Index]:
        result = 1+ longestCommonSequence(s1, s2, s1Index -1, s2Index -1, arr)
    else:
        result= max(longestCommonSequence(s1, s2, s1Index -1, s2Index, arr), longestCommonSequence(s1, s2, s1Index, s2Index -1, arr))
    arr[s1Index][s2Index] = result
    return result   

s1="AAACCGTGAGTTATTCGTTCTAGAA" 
s2="CACCCCTAAGGTACCTTTGGTTC" 

a = [[None for i in range(len(s2))] for j in range(len(s1))]
print(longestCommonSequence(s1, s2, len(s1)-1, len(s2)-1, a))

Comments

-1
def subrange(i, strr) :
    a = i[len(i) - 1]
    for j in range(len(i) - 1, len(strr)) :
       if a == strr[j] :
          return j
    return 
def subseq(strr) :
    prev = [i for i in strr]
    nex1 = []
    nex = []
    a = set()
    for i in prev :
       a.add(i)
    while( len(prev[0])< len(strr)):
       for i in prev :
           for j in range(subrange(i,strr) + 1, len(strr)) :
              nex.append(i + strr[j])
       prev =  nex
       nex = []
       for i in prev :
           a.add(i)
    return a
a1 = []
a2 = []
sol = ""
maxx = 0
str1 = input()
str2 = "sai"
if(len(str1) == 0 or len(str2) == 0) :
   sol = "NULL"
else :
   a1 = list(subseq(str1))
   a2 = list(subseq(str2))
   for i in range(0, len(a1)) :
       for j in range(0, len(a2)) :
          if a1[i] == a2[j] :
             if len(a1[i]) > maxx :
                sol = a1[i]
             maxx = len(sol)
print(sol)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.