Fastest way to compare large strings in python

Question

I have a dictionary of words with their frequencies as follows.

mydictionary = {'yummy tim tam':3, 'fresh milk':2, 'chocolates':5, 'biscuit pudding':3}

I have a set of strings as follows.

recipes_book = "For today's lesson we will show you how to make biscuit pudding using 
yummy tim tam and fresh milk."

In the above string I have "biscuit pudding", "yummy tim tam" and "fresh milk" from the dictionary.

I am currently tokenizing the string to identify the words in the dictionary as follows.

words = recipes_book.split()
for word in words:
    if word in mydictionary:
        print("Match Found!")

However it only works for one word dictionary keys. Hence, I am interested in the fastest way (because my real recipes are very large texts) to identify the dictionary keys with more than one word. Please help me.

Maybe re.findall is what you are looking for. Or maybe some other function in regex library. — Rockybilly
– Rockybilly, Commented Oct 3, 2017 at 6:54

sberry · Accepted Answer · 2017-10-03 14:56:38Z

2

Build up your regex and compile it.

import re

mydictionary = {'yummy tim tam':3, 'fresh milk':2, 'chocolates':5, 'biscuit pudding':3}

searcher = re.compile("|".join(mydictionary.keys()), flags=re.I | re.S)

for match in searcher.findall(recipes_book):
    mydictionary[match] += 1

Output after this

{'yummy tim tam': 4, 'biscuit pudding': 4, 'chocolates': 5, 'fresh milk': 3}

edited Oct 3, 2017 at 14:56

answered Oct 3, 2017 at 7:19

sberry

133k20 gold badges145 silver badges171 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user8566323 Over a year ago

Thank you for the great answer. can you please tell me what exactly happens in searcher line because it is not very clear to me.

Jon Clements Over a year ago

Can just use '|'.join(mydictionary) there - the .format and .keys seems unnecessary...

sberry Over a year ago

@JonClements yep. Started with format because I envisioned using a named group but opted out and never simplified.

J_Zar · Accepted Answer · 2017-10-03 07:26:43Z

1

According to some tests, the "in" keywork is faster than "re" module:

What's a faster operation, re.match/search or str.find?

There is no problem with spaces here. Supposing mydictionary is static (predefined), I think you should probably go for the inverse thing:

for key in mydictionary.iterkeys():
    if key in recipes_book:
        print("Match Found!")
        mydictionary[key] += 1

In python2, using iterkeys you have an iterator and it's a good practice. With python3 you could cycle directly on the dict.

edited Oct 3, 2017 at 7:26

answered Oct 3, 2017 at 7:03

J_Zar

2,5162 gold badges26 silver badges37 bronze badges

1 Comment

theBuzzyCoder Over a year ago

Just a heads up, in python3, iterkeys method is not there for dict data types.

theBuzzyCoder · Accepted Answer · 2017-10-03 07:01:52Z

0

Try the other way around by search the text you want to find in the large chunk of str data.

import re
for item in mydictionary:
    match = re.search(item, recipes_book, flags=re.I | re.S)
    if match:
       start, end = match.span()
       print("Match found for %s between %d and %d character span" % (match.group(0), start, end))

answered Oct 3, 2017 at 7:01

theBuzzyCoder

2,9102 gold badges34 silver badges26 bronze badges

6 Comments

Jon Clements Over a year ago

Why would you want to run multiple regex's though when you can search it for any of the keys at once?

theBuzzyCoder Over a year ago

Yes, you are right!!! But what if you want to find all the patterns that match. "|" will give only one match at a time.

Jon Clements Over a year ago

Well... re.findall('|'.join(mydictionary), recipes_book) seems to work for me... So if one had recipes_book as a Counter, you could then just do an .update on it...

theBuzzyCoder Over a year ago

Cool. You can get same output through multiple solutions. Solutions depend on use cases. In my case, I also wanted to find where exactly the match is found (from which character to which character). re.findall doesn't give that. I just give the list of all match. So basically, you cannot say that one solution is better than other without knowing the use case. If re.findall works for you, it may be what others are looking for... But, I agree that re.findall is also a good solution if you want to find only the match

Jon Clements Over a year ago

Sure - I was just going by what the person who asked the question was after... seems they were after occurrences rather than positions...

|

Collectives™ on Stack Overflow

Fastest way to compare large strings in python

3 Answers 3

3 Comments

1 Comment

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related