3

I have a dictionary of words with their frequencies as follows.

mydictionary = {'yummy tim tam':3, 'fresh milk':2, 'chocolates':5, 'biscuit pudding':3}

I have a set of strings as follows.

recipes_book = "For today's lesson we will show you how to make biscuit pudding using 
yummy tim tam and fresh milk."

In the above string I have "biscuit pudding", "yummy tim tam" and "fresh milk" from the dictionary.

I am currently tokenizing the string to identify the words in the dictionary as follows.

words = recipes_book.split()
for word in words:
    if word in mydictionary:
        print("Match Found!")

However it only works for one word dictionary keys. Hence, I am interested in the fastest way (because my real recipes are very large texts) to identify the dictionary keys with more than one word. Please help me.

1
  • 2
    Maybe re.findall is what you are looking for. Or maybe some other function in regex library. Commented Oct 3, 2017 at 6:54

3 Answers 3

2

Build up your regex and compile it.

import re

mydictionary = {'yummy tim tam':3, 'fresh milk':2, 'chocolates':5, 'biscuit pudding':3}

searcher = re.compile("|".join(mydictionary.keys()), flags=re.I | re.S)

for match in searcher.findall(recipes_book):
    mydictionary[match] += 1

Output after this

{'yummy tim tam': 4, 'biscuit pudding': 4, 'chocolates': 5, 'fresh milk': 3}
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for the great answer. can you please tell me what exactly happens in searcher line because it is not very clear to me.
Can just use '|'.join(mydictionary) there - the .format and .keys seems unnecessary...
@JonClements yep. Started with format because I envisioned using a named group but opted out and never simplified.
1

According to some tests, the "in" keywork is faster than "re" module:

What's a faster operation, re.match/search or str.find?

There is no problem with spaces here. Supposing mydictionary is static (predefined), I think you should probably go for the inverse thing:

for key in mydictionary.iterkeys():
    if key in recipes_book:
        print("Match Found!")
        mydictionary[key] += 1

In python2, using iterkeys you have an iterator and it's a good practice. With python3 you could cycle directly on the dict.

1 Comment

Just a heads up, in python3, iterkeys method is not there for dict data types.
0

Try the other way around by search the text you want to find in the large chunk of str data.

import re
for item in mydictionary:
    match = re.search(item, recipes_book, flags=re.I | re.S)
    if match:
       start, end = match.span()
       print("Match found for %s between %d and %d character span" % (match.group(0), start, end))

6 Comments

Why would you want to run multiple regex's though when you can search it for any of the keys at once?
Yes, you are right!!! But what if you want to find all the patterns that match. "|" will give only one match at a time.
Well... re.findall('|'.join(mydictionary), recipes_book) seems to work for me... So if one had recipes_book as a Counter, you could then just do an .update on it...
Cool. You can get same output through multiple solutions. Solutions depend on use cases. In my case, I also wanted to find where exactly the match is found (from which character to which character). re.findall doesn't give that. I just give the list of all match. So basically, you cannot say that one solution is better than other without knowing the use case. If re.findall works for you, it may be what others are looking for... But, I agree that re.findall is also a good solution if you want to find only the match
Sure - I was just going by what the person who asked the question was after... seems they were after occurrences rather than positions...
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.