string match in Python

Question

I have 300K strings stored in the list, and the length of each string is between 10 and 400. I want to remove the ones that are substring of other strings (the strings with shorter length have higher probability to be the substring of others).

Currently, I first sort these 300K strings based on length, then use below method.

sorted_string = sorted(string_list, key=length, reverse=True)
for item in sorted_string
    for next_item in sorted_string[sorted_string.index(item)+1:]
        if next_item in item:
            del sorted_string[sorted_string.index(next_item)]

The running time of this method is O(n^2). Since I have 300K strings, I am not satisfied with this method.

I have tried to divide these sorted strings into different chunks and use multiprocessing to compute each chunk. My first thought was to put the first 10K to the first chunk, and next 10K to the second chunk, etc. But in this way, strings in each chunk have similar length, and they may not substring of others in the same chunk. So this is not a good divide strategy.

Any good ideas?

Edit: these strings represent DNA sequences, and only contain 'g', 'c', 't' and 'a'

Update:

I have tried to build the suffix tree using the code from https://github.com/kvh/Python-Suffix-Tree. This program builds the suffix tree based on Ukkonen's algorithm.

The total length of concatenated string is about 90,000,000. That is a large number. The program has been running half an hour and just ~3,000,000 (1/30) characters are processed. I am not satisfied with this program.

Is there any other suffix tree building algorithm that can process this large string?

Do you have any guess as to how many strings you will find that are substrings of other strings? That might affect what would work best — Rob Watts
– Rob Watts, Commented Aug 8, 2013 at 22:53
Also, what is the nature of these strings? Are they sentences and, if so, what language are they in? Are they just random characters? Are they representations of dna and so will only contain 'g', 't', 'c', and 'a'? — Rob Watts
– Rob Watts, Commented Aug 8, 2013 at 23:21
@RobWatts Yes, they are DNA sequences, and only contain 'g' 'c' 't' 'a'. And I have no idea how many strings will be substrings. — mitchelllc
– mitchelllc, Commented Aug 8, 2013 at 23:37
Since your alphabet is so small, even more reason to use a suffix tree. :D — kevmo314
– kevmo314, Commented Aug 8, 2013 at 23:50

kevmo314 · Accepted Answer · 2013-08-08 22:39:47Z

2

You could use a suffix tree. It will get you to O(mn) where m is the length of the strings. It's still quadratic, but since m << n in your case, it would provide a noticeable improvement.

These lecture notes provide a pretty good visual explanation of how you can use the suffix tree to find substrings.

edited Aug 8, 2013 at 22:39

answered Aug 8, 2013 at 22:32

kevmo314

4,3376 gold badges34 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Rob Watts Over a year ago

How would you use suffix trees to find which two strings to compare? All I can see is how this would speed things up once you have decided which two strings to compare

kevmo314 Over a year ago

A word is a substring if it's a substring of any of the words, so build a suffix tree based on all the words concatenated together (with a spacer in between them). That should take O(nm) as the new string length is n*m. Then, run each word against the suffix tree, which should also take O(nm), as each search takes O(m) time.

mitchelllc Over a year ago

@kevmo314 So I guess for each word, if we can find it more than two times, then this word is a substring, as we can find each word at least one time in the suffix tree, right?

kevmo314 Over a year ago

Yep, that's the most straightforward way of doing it. There might be a more clever way of doing it without checking for each word twice though. For example, if the "substring" you find is bounded by the two spacers, then it's the original string and not a substring, but the former is probably much more straightforward to implement.

Irsal · Accepted Answer · 2013-08-08 23:55:54Z

0

This is a very cool and very interesting problem. I've studied subset seed algorithms and there are quite a few already out there.

Have you heard of the BLAST algorithm? http://blastalgorithm.com/ A GUI: http://blast.ncbi.nlm.nih.gov/

answered Aug 8, 2013 at 23:55

Irsal

1566 bronze badges

Collectives™ on Stack Overflow

string match in Python

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related