2

I have a huge string like:

The Dormouse's story. Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well....badword...

and I have a list of around 400 bad words:

bad_words = ["badword", "badword1", ....]

what is the most efficient way to check if text contains a bad word from badwords list?

I could loop over both text and list like:

for word in huge_string:
   for bw in bad_words_list: 
    if bw in word: 
       # print "bad word is inside text"... 

but this seems to me to be from 90's..

Update: bad words are single words.

5
  • 3
    so it can be a substring or actual words? if words use sets. Commented Dec 23, 2014 at 12:41
  • @PadraicCunningham actual words for now Commented Dec 23, 2014 at 12:44
  • 2
    Did you try set intersection? Commented Dec 23, 2014 at 12:44
  • Do you just want to know if any badwords are found inside inputstring? Or do you want to know which specifics are found? Commented Dec 23, 2014 at 12:45
  • @DaanTimmer i want to know if any word from badword list is in inputstring Commented Dec 23, 2014 at 12:45

9 Answers 9

4

Turning your text into a set of words and computing its intersection with the set of bad words will give you amortized speed:

text  = "The Dormouse's story. Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well....badword..."

badwords = set(["badword", "badword1", ....])

textwords = set(word for word in text.split())
for badword in badwords.intersection(textwords):
    print("The bad word '{}' was found in the text".format(badword))
Sign up to request clarification or add additional context in comments.

2 Comments

I like this solution, should be more efficient than a for loop with a nested word in text. P.S: you forgot a in in your for loop.
perfect! i need exactly that amortized speed. thanks
2

No need to get all the words of the text, you can directly check if a string is in another string, e.g.:

In [1]: 'bad word' in 'do not say bad words!'
Out[1]: True

So you can just do:

for bad_word in bad_words_list:
    if bad_word in huge_string:
        print "BAD!!"

Comments

1

You can use any:

To test if bad_words are pre/suffixes:

>>> bad_words = ["badword", "badword1"]
>>> text ="some text with badwords or not"
>>> any(i in text for i in bad_words)
True
>>> text ="some text with words or not"
>>> any(i in text for i in bad_words)
False

It will compare any of the bad_words' item are in text, using "substring".

To test exact matches:

>>> text ="some text with badwords or not"
>>> any(i in text.split() for i in bad_words)
False
>>> text ="some text with badword or not"
>>> any(i in text.split() for i in bad_words)
True

It will compare any of the bad_words' item are in text.split(), that is, if it's an exact item.

Comments

1

s is the long string. use & operator or set.intersection method.

In [123]: set(s.split()) & set(bad_words)
Out[123]: {'badword'}

In [124]: bool(set(s.split()) & set(bad_words))
Out[124]: True

Or even better Use set.isdisjoint. This will short circuit as soon as match is found.

In [127]: bad_words = set(bad_words)

In [128]: not bad_words.isdisjoint(s.split())
Out[128]: True

In [129]: not bad_words.isdisjoint('for bar spam'.split())
Out[129]: False

Comments

1

On top of all the excellent answers, the for now, whole words clause in your comment points in the direction of regular expressions.

You may want to build a composed expression like bad|otherbad|yetanother

r = re.compile("|".join(badwords))
r.search(text)

Comments

1

something like:

st = set(s.split())

bad_words = ["badword", "badword1"]
any(bad in st for bad in bad_words)

Or if you want the words:

st = set(s.split())

bad_words = {"badword", "badword1"}
print(st.intersection(bad_words))

If you have words like where the sentence ends in a badword. or badword! then the set method will fail, you will actually have to go over each word in the string and check if any badword is the same as the word or a substring.

st = s.split()
any(bad in word for word in st for bad in bad_words)

Comments

0

i would use a filter function:

filter(lambda s : s in bad_words_list, huge_string.split())

Comments

0

There is already a library for that

from better_profanity import profanity
print(profanity.censor("YOUR_TEXT", "#"))

Comments

-1
s = " a string with bad word"
text = s.split()

if any(bad_word in text for bad_word in ('bad', 'bad2')):
        print "bad word found"

1 Comment

That will only print the last bad_word? any just returns true or false if "any" of the elelements in the list are true(thy)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.