Faster text search in Python with complex if statement

Question

I have a large set of long text documents with punctuation. Three short examples are provided here:

doc = ["My house, the most beautiful!, is NEAR the #seaside. I really love holidays, do you?", "My house, the most beautiful!, is NEAR the #seaside. I really love holidays, do you love dogs?", "My house, the most beautiful!, is NEAR the #sea. I really love holidays, do you?"]

and I have sets of words like the following:

wAND = set(["house", "near"])
wOR = set(["seaside"])
wNOT = set(["dogs"])

I want to search all text documents that meet the following condition:

(any(w in doc for w in wOR) or not wOR) and (all(w in doc for w in wAND) or not wAND) and (not any(w in doc for w in wNOT) or not wNOT)

The or not condition in each parenthesis is needed as the three lists could be empty. Please notice that before applying the condition I also need to clean text from punctuation, transform it to lowercase, and split it into a set of words, which requires additional time.

This process would match the first text in doc but not the second and the third. Indeed, the second would not match as it contains the word "dogs" and the third because it does not include the word "seaside".

I am wondering if this general problem (with words in the wOR, wAND and wNOT lists changing) can be solved in a faster way, avoiding text pre-processing for cleaning. Maybe with a fast regex solution, that perhaps uses Trie(). Is that possible? or any other suggestion?

Depending on your actual task sizes, it might be faster still to turn the sentences into bags (sets) of words; then the comparison becomes something like (w & wAND) == wAND and (wOR < w) and (not w & wNOT)... — AKX
– AKX, Commented May 18, 2022 at 11:22
Will the words in wAND, etc. always consist of letters/digits/underscores? — Wiktor Stribiżew
– Wiktor Stribiżew, Commented May 18, 2022 at 11:24
Thanks a lot @AKX, I tried that but the performance improvement was not super. Any other solution faster? Thanks for the suggestion so far! — Forinstance
– Forinstance, Commented May 18, 2022 at 11:25
@Wiktor yes, they will be letters, digits and underscores. Not including punctuation or anything stranger. — Forinstance
– Forinstance, Commented May 18, 2022 at 11:27
@Forinstance You should probably tell us the approaches you've tried, and your measured speeds for all of them. — AKX
– AKX, Commented May 18, 2022 at 11:28

autumnalblake · Accepted Answer · 2022-05-18 11:27:42Z

2

Your solution appears to be linear in the length of the document - you won't be able to get any better than this without sorting, as the words you're looking for could be anywhere in the document. You could try using one loop over the entire doc:

or_satisfied = False
for w in doc:
    if word in wAND: wAND.remove(word)
    if not or_satisfied and word in wOR: or_satisfied = True
    if word in wNOT: return False
return or_satisfied and not wAND

answered May 18, 2022 at 11:27

autumnalblake

1411 silver badge3 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Forinstance Over a year ago

Thanks I will try this! Would this work even without cleaning text? Or should I work on clean pre-processed word sets? One problem of not cleaning the text and splitting it is that if a search word is part of another (like "sea" is part of "seaside"), it was wrongly matching it.

autumnalblake Over a year ago

This one will still need the text to be cleaned.

Forinstance Over a year ago

Thanks. Will this work also in the case one or more of the 3 lists are empty?

autumnalblake Over a year ago

It will if you add a check for wOR being empty. You can do this at initialisation - or_satisfied = len(wOR) != 0

Forinstance Over a year ago

Thanks! And no problems for empty wAND and wNOT right?

AKX · Accepted Answer · 2022-05-18 14:42:40Z

1

You can build regexps for the word bags you have, and use them:

def make_re(word_set):
    return re.compile(
        r'\b(?:{})\b'.format('|'.join(re.escape(word) for word in word_set)),
        flags=re.I,
    )


wAND_re = make_re(wAND)
wOR_re = make_re(wOR)
wNOT_re = make_re(wNOT)

def re_match(doc):
    if not wOR_re.search(doc):
        return False
    if wNOT_re.search(doc):
        return False
    found = set()
    expected = len(wAND)
    for word in re.finditer(r'\w+', doc):
        found.add(word)
        if len(found) == expected:
            break
    return len(found) == expected

A quick timetest seems to say this is 89% faster than the original (and passes the original "test suite"), ~~likely~~ clearly due to the fact that

documents don't need to be cleaned (since the \bs limit matches to words and re.I deals with case normalization)
regexps are run in native code, which tends to always be faster than Python

name='original'      iters=10000 time=0.206 iters_per_sec=48488.39
name='re_match'      iters=20000 time=0.218 iters_per_sec=91858.73
name='bag_match'     iters=10000 time=0.203 iters_per_sec=49363.58

where bag_match is my original comment suggestion of using set intersections:

def bag_match(doc):
    bag = set(clean_doc(doc))
    return (
        (bag.intersection(wOR) or not wOR) and
        (bag.issuperset(wAND) or not wAND) and
        (not bag.intersection(wNOT) or not wNOT)
    )

If you already have cleaned the documents to an iterable of words (here I just slapped @lru_cache on clean_doc, which you probably wouldn't do in real life since your documents are likely to all be unique and caching wouldn't help), then bag_match is much faster:

name='orig-with-cached-clean-doc' iters=50000 time=0.249 iters_per_sec=200994.97
name='re_match-with-cached-clean-doc' iters=20000 time=0.221 iters_per_sec=90628.94
name='bag_match-with-cached-clean-doc' iters=100000 time=0.265 iters_per_sec=377983.60

edited May 18, 2022 at 14:42

answered May 18, 2022 at 11:53

AKX

171k16 gold badges147 silver badges229 bronze badges

3 Comments

Forinstance Over a year ago

Thanks much for all the explanation and tests, this really helps! I did some additional tests and indeed, among your solutions, bag_match seems to be the best. This has also the advantage to allow me more complex cleaning. One last question, as I have difficulties with time testing (sorry I am new :) ). Do you think that the bag_match solution is faster than the one proposed by autumnalblake?

AKX Over a year ago

bag_match (added my implementation above) seems to be a bit faster than theirs (53769 iter/s vs 48286 iter/s) with this data.

Forinstance Over a year ago

Thanks a lot for testing this further. Very last question: I suppose there is no performance difference between your first implementation of bag_match (that of your first comment) with that added to this question right? I am now using the solution in your first comment, not this very last one. That seems to work even without the "or not" statements.

Collectives™ on Stack Overflow

Faster text search in Python with complex if statement

2 Answers 2

5 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related