Algorithm to find common substring across N strings

Question

I'm familiar with LCS algorithms for 2 strings. Looking for suggestions for finding common substrings in 2..N strings. There may be multiple common substrings in each pair. There can be different common substrings in subsets of the strings.

strings: (ABCDEFGHIJKL) (DEF) (ABCDEF) (BIJKL) (FGH)

common strings:

1/2 (DEF)
1/3 (ABCDEF)
1/4 (IJKL)
1/5 (FGH)
2/3 (DEF)

longest common strings:

1/3 (ABCDEF)

most common strings:

1/2/3 (DEF)

Is it an ACM contest problem which requires algorithm with certain performance? — Roman
– Roman, Commented Mar 10, 2010 at 16:23
Wouldn't the substring 'F' be the most common, as it appears in four strings? — interjay
– interjay, Commented Mar 10, 2010 at 16:24
It would be a good idea to tell us why you need this, so we can understand where we can compromise and where not. — amit kumar
– amit kumar, Commented Mar 10, 2010 at 16:27
Roman - I'm not a student and this isn't for a contest :-). The application is to find common elements in a PDF content stream. interjay - I was ignoring single character substrings — Dwight Kelly
– Dwight Kelly, Commented Mar 10, 2010 at 16:48

Rex Kerr · Accepted Answer · 2010-03-10 17:15:35Z

8

This sort of thing is done all the time in DNA sequence analysis. You can find a variety of algorithms for it. One reasonable collection is listed here.

There's also the brute-force approach of making tables of every substring (if you're interested only in short ones): form an N-ary tree (N=26 for letters, 256 for ASCII) at each level, and store histograms of the count at every node. If you prune off little-used nodes (to keep the memory requirements reasonable), you end up with an algorithm that finds all subsequences of length up to M in something like N*M^2*log(M) time for input of length N. If you instead split this up into K separate strings, you can build the tree structure and just read off the answer(s) in a single pass through the tree.

answered Mar 10, 2010 at 17:15

Rex Kerr

168k27 gold badges325 silver badges411 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Larry Over a year ago

Came to pretty much say this, that this is used in computation biology all the time. However, the definition of "substring/subsequence" is often ambiguous (without intentionally so for non-algorithmists) and I think in this case, his problem requires them to be contiguous.

luispedro · Accepted Answer · 2010-03-10 17:31:48Z

2

SUffix trees are the answer unless you have really large strings where memory becomes a problem. Expect 10~30 bytes of memory usage per character in the string for a good implementation. There are a couple of open-source implementations too, which make your job easier.

There are other, more succint algorithms too, but they are harder to implement (look for "compressed suffix trees").

answered Mar 10, 2010 at 17:31

luispedro

7,0345 gold badges37 silver badges46 bronze badges

Collectives™ on Stack Overflow

Algorithm to find common substring across N strings

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related