Python loop has extremely long runtime

Question

I'm trying to make a vocab list from a set of strings and then remove all words that aren't repeated in at least 30 strings in the set. There are about 300,000 words in total in the set. For some reason the code that checks if a word has been repeated throughout 30 times has runtime of at least over 5 minutes and I was wondering how I could make this code more efficient so it has a reasonable runtime. Thanks!

word_list = []
for item in ex_set:
    word_list += (list(dict.fromkeys(item.split()))) #remove unique words

vocab_list = []
for word in word_list: #where it runs forever
    if word_list.count(word) >= 30:
        vocab_list.append(word)

Are you planning to later use that data for a model? Specifically, are you planning to use something like CountVectorizer or TfidfVectorizer? — Juan C
– Juan C, Commented Jul 30, 2019 at 19:53
Don't use set as a variable name, it's the name of a built-in function. — Barmar
– Barmar, Commented Jul 30, 2019 at 19:54
collections.Counter will be faster than list.count in your loop — Chris_Rands
– Chris_Rands, Commented Jul 30, 2019 at 19:56
Also just one brief comment about the logic of it: if you have a word that is repeated in word_list more than 30 times, you are visiting it redundantly (i.e. after you append it to vocab_list, you don't need to visit it again, but you do...). So try adding one line to remove that repeated word from word_list to ensure you only loop over unique words — manesioz
– manesioz, Commented Jul 30, 2019 at 19:58

Jmonsky · Accepted Answer · 2019-07-30 20:02:18Z

4

If you are trying to get all the words in a list of words that appear at least 30 times, you can count them first using collections.Counter, then find all the ones where they appear more than 30 times.

from collections import Counter
word_counts = Counter(ex_set)

vocab_list = [word for word, count in words.items() if count >= 30]

Just another note, do not use the word set as a variable name as it is a keyword

edited Jul 30, 2019 at 20:02

answered Jul 30, 2019 at 19:59

Jmonsky

1,5291 gold badge9 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Acccumulation Over a year ago

link to documentation: docs.python.org/2/library/collections.html#collections.Counter

Sydney Over a year ago

Thanks, that was a lot more concise than what I had haha

JacobIRR · Accepted Answer · 2019-07-30 20:02:51Z

0

Here is another way of thinking about the problem:

Every single call to count loops over the entire list again (quadratic time).

If you build a dict of word counts, this is a smaller data structure to check on the second iteration:

from collections import defaultdict

counter_dict = defaultdict(int)
for word in word_list:
    counter_dict[word] += 1

vocab_list = []
for word, count in counter_dict.items()
    if count >= 30:
        vocab_list.append(word)

Having seen Jmonsky's answer, if that works, then it should be accepted.

answered Jul 30, 2019 at 20:02

JacobIRR

9,0468 gold badges47 silver badges73 bronze badges

Collectives™ on Stack Overflow

Python loop has extremely long runtime

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related