1

Would like to discuss algorithms, no code.

Problem: Let S and T be two sequences of elements. Find the common subsequences between them where the order of the elements is preserved.

It should have O(n + m) running time where n is the length of S, and m is the length of T. I would also like to make the assumption that for the most part the two sequences will be similar.

An optimal solution?: After some research, one solution that appears to be optimal is to first build a generalised suffix tree for the two sequences. Then find the longest common substring and consider this subsequence to be part of the solution. Then either remove this subsequence from the tree or build a new suffix tree with this subsequence removed from the two original sequences to form S' and T'. Then find the longest common substring between S' and T', and so on.

To analyze the running time, building the tree takes O(n) and you can find the lengths and starting positions of the longest common substrings of S and T in O(n + m).

Are there other (more) practical solutions that someone knows of or can link to? Any published papers considering the same or related problem you all know about? Input and constructive criticism about the above solution? Thanks for all your time!

4
  • 1
    You may want to ask this in cstheory.stackexchange.com Commented Sep 26, 2011 at 5:25
  • @SaeedAmiri Firstly, when you say LCS, are you referring to the longest common substring as my problem discusses, or longest common sequence which is a similar yet different problem? Second, the longest common substring problem can be solved in O(nm) using dynamic programming but by using a suffix tree as I stated in my solution, it is solvable in O(n+m) and in fact can be improved to be solvable in linear time! Commented Sep 26, 2011 at 5:58
  • @mcorley, Yeah you are right I think you mean longest common subsequence (as your problem title) and I didn't read it carefully. Commented Sep 26, 2011 at 6:03
  • assume strings like aaaaabbbbcccdde and eddcccbbbbaaaaa in such a way your construction takes sqrt(n) * n not n to find all substrings. (assume alphabet has at least n variable, in other case you can find similar contradiction.) Commented Sep 26, 2011 at 6:25

1 Answer 1

1

My first thought was the use of a suffix tree, and relating it to the LCS problem. But I am not sure what a better solution would be off the top of my head. I did a quick search and came across a few papers and projects that might be useful, but no guarantees.

http://dl.acm.org/citation.cfm?id=1625377 (direct link here I believe: http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-101.pdf)

http://code.google.com/p/all-common-subsequences/

Sorry, it has been a long day and I am not quite awake enough to attempt a better solution myself.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.