12

Assume I have a set of strings S and a query string q. I want to know if any member of S is a substring of q. (For the purpose of this question substring includes equality, e.g. "foo" is a substring of "foo".) For example assume the function that does what I want is called anySubstring:

S = ["foo", "baz"]
q = "foobar"
assert anySubstring(S, q)  # "foo" is a substring of "foobar"

S = ["waldo", "baz"]
assert not anySubstring(S, q)

Is there any easy-to-implement algorithm to do this with time complexity sublinear in len(S)? It's ok if S has to be processed into some clever data structure first because I will be querying each S with a lot of q strings, so the amortized cost of this preprocessing might be reasonable.

EDIT: To clarify, I don't care which member of S is a substring of q, only whether at least one is. In other words, I only care about a boolean answer.

1
  • 2
    Are the strings in S short? How does the length of the longest string in S compare with the length of S? Commented Mar 9, 2012 at 16:04

3 Answers 3

15

I think Aho-Corasick algorithm does what you want. I think there is another solution which is very simple to implement, it's Karp-Rabin algorithm.

Sign up to request clarification or add additional context in comments.

2 Comments

I cannot believe I am the only person to upvote the right answer.
And there's a nice #Python library for Aho-Corasick pypi.org/project/pyahocorasick ! :-D very useful answer!
2

So if the length of S is way less then the sum of the lengths of the potential substrings your best option would be to build a suffix tree from S and then do a search in it. This is linear with respect to the length of S plus the summar length of the candidate substrings. Of course there can not be an algorithm with better complexity as you have to pass through all the input at least. If the case is opposite i.e. the length of s is more then the summar length of the substrings your best option would be aho-corasick.

Hope this helps.

4 Comments

It fits for the other way around - querying many times with a single q, but in here - the S is constant, and q is changing...
@amit I think Aho-Corasick does what OP wants.. It operates much like your solution except that the trie has failure transitions that take you to the right position in the automaton.
@aelguindy: I was speaking only about the suffix tree approach, I am afraid I am not familiar with the mentioned algorithm, but I will read how it works later today.
@amit Yes, you are right regarding your comment about the suffix tree, I though you're commenting on the entire answer, sorry :-).
2

Create a regular expression .*(S1|S2|...|Sn).* and construct its minimal DFA.

Run your query string through the DFA.

3 Comments

@SaeedAmiri If OP does not care at all about the efficiency of the preprocessing, then once the DFA is created, all queries run in time linear in the length of q, independent of any aspect of S.
@aelguindy, suffix tree works in similar Order with less constant factor and less memory, in fact you should convert your DFA finally to something like trie.
@SaeedAmiri How do you check that your query string has the suffix tree string as a substring? Suffix trees should be built when the text is fixed not the patterns..

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.