Note: This question isn't specifically related to Python. I just used it as a substitute for pseudo-code here.
Given an array A containing N strings with an average length M, I want to create a new array B that only contains those strings in A that are not a substring (or and identical copy) of any other string in A. Here is an example:
A = [ 'foo', 'bar', 'foobar', 'foobar' ]
B = [ 'foobar' ]
I'm specifically looking for the most efficient way of doing this in terms of time complexity. The naive approach would look like this
B = []
for i in range(0, len(A)):
noSubstring = True
for j in range(i + 1, len(A)):
if A[i] in A[j]:
noSubstring = False
break
if noSubstring:
B.append(A[i])
and has a time complexity of O(N^2 * M^2). Is there anything I could do to speed this up?
I've been thinking about using a dedicated data structure for efficiently encoding and reusing the string sequences. For instance, if I wanted to remove strings that are only a prefix of another string in the array, I could create a trie / prefix tree (O(N*M)) and then collect all the leaf elements (another O(N*M)). So far I failed, to adapt this approach to the more general substring problem, however.
setforA, since you don't want duplicates anyway.