1

Lets say I've loaded some information from a file into a Python3 dict and the result looks like this.

d = {
    'hello' : ['hello', 'hi', 'greetings'],
    'goodbye': ['bye', 'goodbye', 'adios'],
    'lolwut': ['++$(@$(@%$(@#*', 'ASDF #!@# TOW']
}

Let's say I'm going to analyze a bunch, I mean an absolute ton, of strings. If a string contains any of the values for a given key of d, then I want to categorize it as being in that key.

For example...

'My name is DDP, greetings' => 'hello'

Obviously I can loop through the keys and values like this...

def classify(s, d):
    for k, v in d.items():
        if any([x in s for x in v]):
            return k

    return ''

But I want to know if there's a more efficient algorithm for this kind of bulk searching; more efficient than my naive loop. Is anyone aware of such an algorithm?

4
  • This question is kind of opinion based, but the most effecient would be to presort them. Then just use the fastest algorithm for search a sorted list Commented Feb 19, 2020 at 20:25
  • Presort what? If I was looking for an exact match, I could presort the dictionary's values, but I'm checking if any of them are substrings. Commented Feb 19, 2020 at 20:41
  • presort the dictionary so the search in it would be faster, but i guess that is irrelavant since python has the in command Commented Feb 19, 2020 at 21:08
  • just forget what i said. Commented Feb 19, 2020 at 21:08

1 Answer 1

1

You can use regex to avoid extra operations. Here all you need is to join the words with a pip character and pass it to re.search(). Since the order or the exact word is not important to you this way you can find out if there's any intersection between any of those values and the given string.

import re

def classify(s, d):
    for k, v in d.items():
        regex = re.compile(re.escape(r'|'.join(v)))
        if regex.search(s):
            return k

Also note that you can, instead of returning k yield it to get an iterator of all occurrences or use a dictionary to store them, etc.

Sign up to request clarification or add additional context in comments.

4 Comments

I like your idea's principle, but that specific implementation doesn't appear to handle nasty looking strings. eg. d['lolwut'] = ['123!@#%%^&*)()))'] will tell me that my regex has unbalanced parenthesis. I don't need regex, I'm just looking for substring.
@DeepDeadpool It doesn't make sense to have that string given the example you presented lol however you can use re.escape() to escape special characters. Check out the update.
Neat - I'll check it out
Seems to work - thanks for the update. I'll wait a few days to see if any other people offer other solutions.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.