0

I have a large set of long text documents with punctuation. Three short examples are provided here:

doc = ["My house, the most beautiful!, is NEAR the #seaside. I really love holidays, do you?", "My house, the most beautiful!, is NEAR the #seaside. I really love holidays, do you love dogs?", "My house, the most beautiful!, is NEAR the #sea. I really love holidays, do you?"]

and I have sets of words like the following:

wAND = set(["house", "near"])
wOR = set(["seaside"])
wNOT = set(["dogs"])

I want to search all text documents that meet the following condition:

(any(w in doc for w in wOR) or not wOR) and (all(w in doc for w in wAND) or not wAND) and (not any(w in doc for w in wNOT) or not wNOT)

The or not condition in each parenthesis is needed as the three lists could be empty. Please notice that before applying the condition I also need to clean text from punctuation, transform it to lowercase, and split it into a set of words, which requires additional time.

This process would match the first text in doc but not the second and the third. Indeed, the second would not match as it contains the word "dogs" and the third because it does not include the word "seaside".

I am wondering if this general problem (with words in the wOR, wAND and wNOT lists changing) can be solved in a faster way, avoiding text pre-processing for cleaning. Maybe with a fast regex solution, that perhaps uses Trie(). Is that possible? or any other suggestion?

11
  • 1
    Depending on your actual task sizes, it might be faster still to turn the sentences into bags (sets) of words; then the comparison becomes something like (w & wAND) == wAND and (wOR < w) and (not w & wNOT)... Commented May 18, 2022 at 11:22
  • Will the words in wAND, etc. always consist of letters/digits/underscores? Commented May 18, 2022 at 11:24
  • Thanks a lot @AKX, I tried that but the performance improvement was not super. Any other solution faster? Thanks for the suggestion so far! Commented May 18, 2022 at 11:25
  • @Wiktor yes, they will be letters, digits and underscores. Not including punctuation or anything stranger. Commented May 18, 2022 at 11:27
  • @Forinstance You should probably tell us the approaches you've tried, and your measured speeds for all of them. Commented May 18, 2022 at 11:28

2 Answers 2

2

Your solution appears to be linear in the length of the document - you won't be able to get any better than this without sorting, as the words you're looking for could be anywhere in the document. You could try using one loop over the entire doc:

or_satisfied = False
for w in doc:
    if word in wAND: wAND.remove(word)
    if not or_satisfied and word in wOR: or_satisfied = True
    if word in wNOT: return False
return or_satisfied and not wAND
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks I will try this! Would this work even without cleaning text? Or should I work on clean pre-processed word sets? One problem of not cleaning the text and splitting it is that if a search word is part of another (like "sea" is part of "seaside"), it was wrongly matching it.
This one will still need the text to be cleaned.
Thanks. Will this work also in the case one or more of the 3 lists are empty?
It will if you add a check for wOR being empty. You can do this at initialisation - or_satisfied = len(wOR) != 0
Thanks! And no problems for empty wAND and wNOT right?
1

You can build regexps for the word bags you have, and use them:

def make_re(word_set):
    return re.compile(
        r'\b(?:{})\b'.format('|'.join(re.escape(word) for word in word_set)),
        flags=re.I,
    )


wAND_re = make_re(wAND)
wOR_re = make_re(wOR)
wNOT_re = make_re(wNOT)

def re_match(doc):
    if not wOR_re.search(doc):
        return False
    if wNOT_re.search(doc):
        return False
    found = set()
    expected = len(wAND)
    for word in re.finditer(r'\w+', doc):
        found.add(word)
        if len(found) == expected:
            break
    return len(found) == expected

A quick timetest seems to say this is 89% faster than the original (and passes the original "test suite"), likely clearly due to the fact that

  • documents don't need to be cleaned (since the \bs limit matches to words and re.I deals with case normalization)
  • regexps are run in native code, which tends to always be faster than Python
name='original'      iters=10000 time=0.206 iters_per_sec=48488.39
name='re_match'      iters=20000 time=0.218 iters_per_sec=91858.73
name='bag_match'     iters=10000 time=0.203 iters_per_sec=49363.58

where bag_match is my original comment suggestion of using set intersections:

def bag_match(doc):
    bag = set(clean_doc(doc))
    return (
        (bag.intersection(wOR) or not wOR) and
        (bag.issuperset(wAND) or not wAND) and
        (not bag.intersection(wNOT) or not wNOT)
    )

If you already have cleaned the documents to an iterable of words (here I just slapped @lru_cache on clean_doc, which you probably wouldn't do in real life since your documents are likely to all be unique and caching wouldn't help), then bag_match is much faster:

name='orig-with-cached-clean-doc' iters=50000 time=0.249 iters_per_sec=200994.97
name='re_match-with-cached-clean-doc' iters=20000 time=0.221 iters_per_sec=90628.94
name='bag_match-with-cached-clean-doc' iters=100000 time=0.265 iters_per_sec=377983.60

3 Comments

Thanks much for all the explanation and tests, this really helps! I did some additional tests and indeed, among your solutions, bag_match seems to be the best. This has also the advantage to allow me more complex cleaning. One last question, as I have difficulties with time testing (sorry I am new :) ). Do you think that the bag_match solution is faster than the one proposed by autumnalblake?
bag_match (added my implementation above) seems to be a bit faster than theirs (53769 iter/s vs 48286 iter/s) with this data.
Thanks a lot for testing this further. Very last question: I suppose there is no performance difference between your first implementation of bag_match (that of your first comment) with that added to this question right? I am now using the solution in your first comment, not this very last one. That seems to work even without the "or not" statements.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.