python - efficient way of checking if part of string is in the list

Question

I have a huge string like:

The Dormouse's story. Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well....badword...

and I have a list of around 400 bad words:

bad_words = ["badword", "badword1", ....]

what is the most efficient way to check if text contains a bad word from badwords list?

I could loop over both text and list like:

for word in huge_string:
   for bw in bad_words_list: 
    if bw in word: 
       # print "bad word is inside text"...

but this seems to me to be from 90's..

Update: bad words are single words.

so it can be a substring or actual words? if words use sets. — Padraic Cunningham
– Padraic Cunningham, Commented Dec 23, 2014 at 12:41
Do you just want to know if any badwords are found inside inputstring? Or do you want to know which specifics are found? — Daan Timmer
– Daan Timmer, Commented Dec 23, 2014 at 12:45
@DaanTimmer i want to know if any word from badword list is in inputstring — doniyor
– doniyor, Commented Dec 23, 2014 at 12:45

inspectorG4dget · Accepted Answer · 2014-12-23 12:58:27Z

4

Turning your text into a set of words and computing its intersection with the set of bad words will give you amortized speed:

text  = "The Dormouse's story. Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well....badword..."

badwords = set(["badword", "badword1", ....])

textwords = set(word for word in text.split())
for badword in badwords.intersection(textwords):
    print("The bad word '{}' was found in the text".format(badword))

edited Dec 23, 2014 at 12:58

answered Dec 23, 2014 at 12:46

inspectorG4dget

115k30 gold badges159 silver badges253 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

LeartS Over a year ago

I like this solution, should be more efficient than a for loop with a nested word in text. P.S: you forgot a in in your for loop.

doniyor Over a year ago

perfect! i need exactly that amortized speed. thanks

LeartS · Accepted Answer · 2014-12-23 12:50:07Z

2

No need to get all the words of the text, you can directly check if a string is in another string, e.g.:

In [1]: 'bad word' in 'do not say bad words!'
Out[1]: True

So you can just do:

for bad_word in bad_words_list:
    if bad_word in huge_string:
        print "BAD!!"

edited Dec 23, 2014 at 12:50

answered Dec 23, 2014 at 12:44

LeartS

2,9162 gold badges27 silver badges46 bronze badges

Comments

fredtantini · Accepted Answer · 2014-12-23 12:46:41Z

1

You can use any:

To test if bad_words are pre/suffixes:

>>> bad_words = ["badword", "badword1"]
>>> text ="some text with badwords or not"
>>> any(i in text for i in bad_words)
True
>>> text ="some text with words or not"
>>> any(i in text for i in bad_words)
False

It will compare any of the bad_words' item are in text, using "substring".

To test exact matches:

>>> text ="some text with badwords or not"
>>> any(i in text.split() for i in bad_words)
False
>>> text ="some text with badword or not"
>>> any(i in text.split() for i in bad_words)
True

It will compare any of the bad_words' item are in text.split(), that is, if it's an exact item.

answered Dec 23, 2014 at 12:46

fredtantini

16.6k8 gold badges51 silver badges58 bronze badges

Comments

Vishnu Upadhyay · Accepted Answer · 2014-12-23 12:55:46Z

1

s is the long string. use & operator or set.intersection method.

In [123]: set(s.split()) & set(bad_words)
Out[123]: {'badword'}

In [124]: bool(set(s.split()) & set(bad_words))
Out[124]: True

Or even better Use set.isdisjoint. This will short circuit as soon as match is found.

In [127]: bad_words = set(bad_words)

In [128]: not bad_words.isdisjoint(s.split())
Out[128]: True

In [129]: not bad_words.isdisjoint('for bar spam'.split())
Out[129]: False

edited Dec 23, 2014 at 12:55

answered Dec 23, 2014 at 12:47

Vishnu Upadhyay

5,0611 gold badge17 silver badges24 bronze badges

Comments

xtofl · Accepted Answer · 2014-12-23 12:56:56Z

1

On top of all the excellent answers, the for now, whole words clause in your comment points in the direction of regular expressions.

You may want to build a composed expression like bad|otherbad|yetanother

r = re.compile("|".join(badwords))
r.search(text)

answered Dec 23, 2014 at 12:56

xtofl

41.7k13 gold badges112 silver badges203 bronze badges

Comments

Padraic Cunningham · Accepted Answer · 2014-12-23 12:58:56Z

1

something like:

st = set(s.split())

bad_words = ["badword", "badword1"]
any(bad in st for bad in bad_words)

Or if you want the words:

st = set(s.split())

bad_words = {"badword", "badword1"}
print(st.intersection(bad_words))

If you have words like where the sentence ends in a badword. or badword! then the set method will fail, you will actually have to go over each word in the string and check if any badword is the same as the word or a substring.

st = s.split()
any(bad in word for word in st for bad in bad_words)

edited Dec 23, 2014 at 12:58

answered Dec 23, 2014 at 12:46

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Comments

Riccardo · Accepted Answer · 2014-12-23 13:10:07Z

0

i would use a filter function:

filter(lambda s : s in bad_words_list, huge_string.split())

answered Dec 23, 2014 at 13:10

Riccardo

1,5302 gold badges12 silver badges22 bronze badges

Comments

user16435384 · Accepted Answer · 2022-09-25 20:32:04Z

0

There is already a library for that

from better_profanity import profanity
print(profanity.censor("YOUR_TEXT", "#"))

answered Sep 25, 2022 at 20:32

user16435384

Comments

A.Kareem · Accepted Answer · 2014-12-23 13:23:16Z

-1

s = " a string with bad word"
text = s.split()

if any(bad_word in text for bad_word in ('bad', 'bad2')):
        print "bad word found"

edited Dec 23, 2014 at 13:23

answered Dec 23, 2014 at 12:51

A.Kareem

11 bronze badge

1 Comment

Daan Timmer Over a year ago

That will only print the last bad_word? any just returns true or false if "any" of the elelements in the list are true(thy)

Collectives™ on Stack Overflow

python - efficient way of checking if part of string is in the list

9 Answers 9

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related