Algorithm: Find all common substrings between two strings where order is preserved

Question

Would like to discuss algorithms, no code.

Problem: Let S and T be two sequences of elements. Find the common subsequences between them where the order of the elements is preserved.

It should have O(n + m) running time where n is the length of S, and m is the length of T. I would also like to make the assumption that for the most part the two sequences will be similar.

An optimal solution?: After some research, one solution that appears to be optimal is to first build a generalised suffix tree for the two sequences. Then find the longest common substring and consider this subsequence to be part of the solution. Then either remove this subsequence from the tree or build a new suffix tree with this subsequence removed from the two original sequences to form S' and T'. Then find the longest common substring between S' and T', and so on.

To analyze the running time, building the tree takes O(n) and you can find the lengths and starting positions of the longest common substrings of S and T in O(n + m).

Are there other (more) practical solutions that someone knows of or can link to? Any published papers considering the same or related problem you all know about? Input and constructive criticism about the above solution? Thanks for all your time!

@SaeedAmiri Firstly, when you say LCS, are you referring to the longest common substring as my problem discusses, or longest common sequence which is a similar yet different problem? Second, the longest common substring problem can be solved in O(nm) using dynamic programming but by using a suffix tree as I stated in my solution, it is solvable in O(n+m) and in fact can be improved to be solvable in linear time! — mcorley
– mcorley, Commented Sep 26, 2011 at 5:58
@mcorley, Yeah you are right I think you mean longest common subsequence (as your problem title) and I didn't read it carefully. — Saeed Amiri
– Saeed Amiri, Commented Sep 26, 2011 at 6:03
assume strings like aaaaabbbbcccdde and eddcccbbbbaaaaa in such a way your construction takes sqrt(n) * n not n to find all substrings. (assume alphabet has at least n variable, in other case you can find similar contradiction.) — Saeed Amiri
– Saeed Amiri, Commented Sep 26, 2011 at 6:25

Travis · Accepted Answer · 2011-09-26 05:33:58Z

1

My first thought was the use of a suffix tree, and relating it to the LCS problem. But I am not sure what a better solution would be off the top of my head. I did a quick search and came across a few papers and projects that might be useful, but no guarantees.

http://dl.acm.org/citation.cfm?id=1625377 (direct link here I believe: http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-101.pdf)

http://code.google.com/p/all-common-subsequences/

Sorry, it has been a long day and I am not quite awake enough to attempt a better solution myself.

answered Sep 26, 2011 at 5:33

Travis

6791 gold badge13 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Algorithm: Find all common substrings between two strings where order is preserved

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related