0

I have this simple piece of code that tells me if a word in a given list appears in an article:

 if not any(word in article.text for word in keywords):
        print("Skipping article as there is no matching keyword\n")

What I need is if at least 3 words in the "keywords" list appear in the article - if they don't then it should skip the article.

Is there an easy way to do this? I can't seem to find anything.

2
  • sum(word in article.text for word in keywords) >= 3 (write an explicit loop if you want to break earlier). Commented Jan 5, 2016 at 6:06
  • Thanks Sebastian - it worked! Commented Jan 5, 2016 at 7:53

3 Answers 3

3

You can count the number of items that satisfy a condition using this pattern:

sum(1 for x in xs if c(x))

Here you would do:

if sum(1 for word in keywords if word in article.text) >= 3:
    # 
Sign up to request clarification or add additional context in comments.

2 Comments

or sum(word in article.txt for word in keywords) (True and False are equivalent to 1 and 0)
@kindall: you're right! I don't know why I've always done it using the explicit 1.
2

If the set of keywords is large enough and the string being searched is long enough that it's often worth short-circuiting, a variation on other approaches that will stop when three hits are found (much like any stops when one hit found):

from itertools import islice

if sum(islice((1 for word in keywords if word in article.text), 3)) == 3:

Once you get three hits, it immediately stops iterating the keywords and the test passes.

4 Comments

My text and lists are pretty long so you provided a better solution - thanks!
@Del: if the text is large and there are many keywords then you could use Aho-Corasick algorithm (like grep -Ff keywords.txt text.txt).
I wouldn't have a clue how to do that Sebastian - any possibility of some python code as an example? I assume the grep is a linux bash command - or am I mistaken? My keywords are pulled from a db and the article is pulled down from a url
@Del: There are a couple existing third party packages for performing Aho-Corasick efficiently in Python; you'd want to check them for info. The basic idea is that you build a searcher from your keywords once, and you can then find all the words of your searcher in a single pass through the text (and you can process the text iteratively, so again, you can stop when you find three unique hits).
0

My text and lists are pretty long

if the text is large and there are many keywords then you could use Aho-Corasick algorithm (like grep -Ff keywords.txt text.txt) e.g., if you want to find non-overlapping occurrences, you could use noaho package (not tested):

#!/usr/bin/env python
from itertools import islice
from noaho import NoAho  # $ pip install noaho

trie = NoAho()
for word in keywords:
    trie.add(word)
found_words = trie.findall_long(article.text)
if len(list(islice(found_words, 3))) == 3:
    print('at least 3 words in the "keywords" list appear in the article')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.