Python3 Fast Way To Find If Any Elements In Collections Are Substring Of String

Question

If I have a collection of strings is there a data structure or function that could improve the speed of checking if any of the elements of the collections are substrings on my main string?

Right now I'm looping through my array of strings and using the in operator. Is there a faster way?

import timing

## string match in first do_not_scan
## 0:00:00.029332

## string not in do_not_scan
## 0:00:00.035179
def check_if_substring():
    for x in do_not_scan:
        if x in string:
            return True
    return False

## string match in first do_not_scan
## 0:00:00.046530

## string not in do_not_scan
## 0:00:00.067439
def index_of():
    for x in do_not_scan:
        try:
            string.index(x)
            return True
        except:
            return False

## string match in first do_not_scan
## 0:00:00.047654

## string not in do_not_scan
## 0:00:00.070596
def find_def():
    for x in do_not_scan:
        if string.find(x) != -1:
            return True
    return False

string = '/usr/documents/apps/components/login'
do_not_scan = ['node_modules','bower_components']

for x in range(100000):
    find_def()
    index_of()
    check_if_substring()

Is it possible that you pasted something wrong here. Or is string = 'a' just a sample. Because node_modules will never be in string. With that said, can you use a map. Where the keys are the items of do_not_scan. Then the search is O(1) — Cripto
– Cripto, Commented Mar 4, 2016 at 18:13
just a sample to demonstrate string may not contain any elements of do_not_scan. I've never used a map before how would you go about doing that? — ClickThisNick
– ClickThisNick, Commented Mar 4, 2016 at 18:19
Do you want the analog of grep -l -Ff collections_of_strings main_string? where collections_of_strings file contains a collections of strings (one per line) and main_string file contains the main string (as is). — jfs
– jfs, Commented Mar 5, 2016 at 16:04
I will edit, meant is there a better data structure eg. put things in a set instead of an array that would make the search faster — ClickThisNick
– ClickThisNick, Commented Mar 6, 2016 at 2:50

Community · Accepted Answer · 2017-05-23 11:44:19Z

3

No, there is no faster built-in way.

If you have large amounts of strings to test for, then you may be better off using a third-party Aho-Corasick package, as J.F. Sebastian's answer shows.

Using the built-in methods, the worst-case scenario is: there is no match, which means you have tested every item in the list and nearly every offset in every item.

Fortunately, the in operator is very quick (at least in CPython) and was faster by nearly a factor of three in my tests:

0.3364804992452264  # substring()
0.867534976452589   # any_substring()
0.8401796016842127  # find_def()
0.9342398950830102  # index_of()
2.7920695478096604  # re implementation

Here is the script I used for testing:

from timeit import timeit
import re

def substring():
    for x in do_not_scan:
        if x in string:
            return True
    return False

def any_substring():
    return any(x in string for x in do_not_scan)

def find_def():
    for x in do_not_scan:
        if string.find(x) != -1:
            return True
    return False

def index_of():
    for x in do_not_scan:
        try:
            string.index(x)
            return True
        except:
            return False

def re_match():
    for x in do_not_scan:
        if re.search(string, x):
            return True
    return False

string = 'a'
do_not_scan = ['node_modules','bower_components']

print(timeit('substring()', setup='from __main__ import substring'))
print(timeit('any_substring()', setup='from __main__ import any_substring'))
print(timeit('find_def()', setup='from __main__ import find_def'))
print(timeit('index_of()', setup='from __main__ import index_of'))
print(timeit('re_match()', setup='from __main__ import re_match'))

edited May 23, 2017 at 11:44

CommunityBot

11 silver badge

answered Mar 4, 2016 at 19:51

Ethan Furman

70.1k21 gold badges174 silver badges251 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

jfs Over a year ago

it is incorrect. You can do better than O(n*m) e.g., Aho–Corasick algorithm is O(n + m) in time. grep may use it for fixed strings

Cripto · Accepted Answer · 2016-03-04 21:22:57Z

2

I don't have a large dataset to try:

But maybes something like this will work?

python3

from builtins import any
import timeit

do_not_scan = ['node_modules', 'bower_components']
string = 'a'


def check_if_substring():
    return any(string in x for x in do_not_scan)


result = timeit.Timer("check_if_substring()", "from __main__ import check_if_substring")
count = 10000
print(result.timeit(count)/count)

Or the other way around:

def check_if_substring():
    return any(x in string for x in do_not_scan)

My results: 6.48119201650843e-07

edited Mar 4, 2016 at 21:22

answered Mar 4, 2016 at 21:18

Cripto

3,7717 gold badges45 silver badges66 bronze badges

3 Comments

Two-Bit Alchemist Over a year ago

Just curious -- why are you renaming any and why that way?

Cripto Over a year ago

It was a copy and past from and old code. Doesn't make sense in this context. Ill fix it

Ethan Furman Over a year ago

any is as slow as find_def and index_of.

Karol · Accepted Answer · 2016-03-05 10:05:48Z

2

def check():
    if any(w in string for w in do_not_scan):
        return True
    else:
        return False

Or simpler:

def check():
    return any(w in string for w in do_not_scan)

as mentioned by @Two-Bit Alchemist

edited Mar 5, 2016 at 10:05

answered Mar 4, 2016 at 18:13

Karol

3111 silver badge8 bronze badges

3 Comments

ClickThisNick Over a year ago

string match in first do_not_scan = 0:00:00.085493 | string not in do_not_scan = 0:00:00.074540

Two-Bit Alchemist Over a year ago

Simpler: return any(w in string for w in do_not_scan)

Ethan Furman Over a year ago

any is as slow as find_def and index_of.

jfs · Accepted Answer · 2016-03-05 16:49:59Z

2

Yes, there is a faster way to perform found = any(s in main_string for s in collection_of_strings) e.g., there is Aho-Corasick_algorithm that allows to improve any()-based O(n*m*k) algorithm to O(n + m*k) in time operation where n is len(main_string), m is len(collections_of_strings), and k represents individual lengths of the strings in the collection.

#!/usr/bin/env python
import noaho # $ pip install noaho

trie = noaho.NoAho()
for s in collection_of_strings:
    trie.add(s)
found = trie.find_short(main_string)[0] is not None

Note: there is no point to measure the time performance on tiny strings such as string = 'a' if you are interested in Big-O behavior. Either use a more representative sample for the benchmark or you don't need a faster (asymptotically) algorithm in your case.

edited Mar 5, 2016 at 16:49

answered Mar 5, 2016 at 16:44

jfs

417k210 gold badges1k silver badges1.7k bronze badges

2 Comments

Ethan Furman Over a year ago

Can you offer any guidance of where the cut-off is to use an Aho-Corasick algorithm over in?

jfs Over a year ago

Only your profiler knows. Constant factors depend on the quality of the implementation e.g., str.translate() in Python 3.5+ on ASCII-only input may be 50 times faster than the same code on previous Python 3 versions.

Collectives™ on Stack Overflow

Python3 Fast Way To Find If Any Elements In Collections Are Substring Of String

4 Answers 4

1 Comment

3 Comments

3 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

3 Comments

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related