I have 300K strings stored in the list, and the length of each string is between 10 and 400. I want to remove the ones that are substring of other strings (the strings with shorter length have higher probability to be the substring of others).
Currently, I first sort these 300K strings based on length, then use below method.
sorted_string = sorted(string_list, key=length, reverse=True)
for item in sorted_string
for next_item in sorted_string[sorted_string.index(item)+1:]
if next_item in item:
del sorted_string[sorted_string.index(next_item)]
The running time of this method is O(n^2). Since I have 300K strings, I am not satisfied with this method.
I have tried to divide these sorted strings into different chunks and use multiprocessing to compute each chunk. My first thought was to put the first 10K to the first chunk, and next 10K to the second chunk, etc. But in this way, strings in each chunk have similar length, and they may not substring of others in the same chunk. So this is not a good divide strategy.
Any good ideas?
Edit: these strings represent DNA sequences, and only contain 'g', 'c', 't' and 'a'
Update:
I have tried to build the suffix tree using the code from https://github.com/kvh/Python-Suffix-Tree. This program builds the suffix tree based on Ukkonen's algorithm.
The total length of concatenated string is about 90,000,000. That is a large number. The program has been running half an hour and just ~3,000,000 (1/30) characters are processed. I am not satisfied with this program.
Is there any other suffix tree building algorithm that can process this large string?