Efficient group substring search in Python?

Question

Lets say I've loaded some information from a file into a Python3 dict and the result looks like this.

d = {
    'hello' : ['hello', 'hi', 'greetings'],
    'goodbye': ['bye', 'goodbye', 'adios'],
    'lolwut': ['++$(@$(@%$(@#*', 'ASDF #!@# TOW']
}

Let's say I'm going to analyze a bunch, I mean an absolute ton, of strings. If a string contains any of the values for a given key of d, then I want to categorize it as being in that key.

For example...

'My name is DDP, greetings' => 'hello'

Obviously I can loop through the keys and values like this...

def classify(s, d):
    for k, v in d.items():
        if any([x in s for x in v]):
            return k

    return ''

But I want to know if there's a more efficient algorithm for this kind of bulk searching; more efficient than my naive loop. Is anyone aware of such an algorithm?

This question is kind of opinion based, but the most effecient would be to presort them. Then just use the fastest algorithm for search a sorted list — DontBe3Greedy
– DontBe3Greedy, Commented Feb 19, 2020 at 20:25
Presort what? If I was looking for an exact match, I could presort the dictionary's values, but I'm checking if any of them are substrings. — Raven
– Raven, Commented Feb 19, 2020 at 20:41
presort the dictionary so the search in it would be faster, but i guess that is irrelavant since python has the in command — DontBe3Greedy
– DontBe3Greedy, Commented Feb 19, 2020 at 21:08

Kasravnd · Accepted Answer · 2020-02-19 20:52:35Z

1

You can use regex to avoid extra operations. Here all you need is to join the words with a pip character and pass it to re.search(). Since the order or the exact word is not important to you this way you can find out if there's any intersection between any of those values and the given string.

import re

def classify(s, d):
    for k, v in d.items():
        regex = re.compile(re.escape(r'|'.join(v)))
        if regex.search(s):
            return k

Also note that you can, instead of returning k yield it to get an iterator of all occurrences or use a dictionary to store them, etc.

edited Feb 19, 2020 at 20:52

answered Feb 19, 2020 at 20:27

Kasravnd

108k19 gold badges167 silver badges195 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Raven Over a year ago

I like your idea's principle, but that specific implementation doesn't appear to handle nasty looking strings. eg. d['lolwut'] = ['123!@#%%^&*)()))'] will tell me that my regex has unbalanced parenthesis. I don't need regex, I'm just looking for substring.

Kasravnd Over a year ago

@DeepDeadpool It doesn't make sense to have that string given the example you presented lol however you can use re.escape() to escape special characters. Check out the update.

Raven Over a year ago

Neat - I'll check it out

Raven Over a year ago

Seems to work - thanks for the update. I'll wait a few days to see if any other people offer other solutions.

Collectives™ on Stack Overflow

Efficient group substring search in Python?

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related