I'm trying to make a vocab list from a set of strings and then remove all words that aren't repeated in at least 30 strings in the set. There are about 300,000 words in total in the set. For some reason the code that checks if a word has been repeated throughout 30 times has runtime of at least over 5 minutes and I was wondering how I could make this code more efficient so it has a reasonable runtime. Thanks!
word_list = []
for item in ex_set:
word_list += (list(dict.fromkeys(item.split()))) #remove unique words
vocab_list = []
for word in word_list: #where it runs forever
if word_list.count(word) >= 30:
vocab_list.append(word)
CountVectorizerorTfidfVectorizer?setas a variable name, it's the name of a built-in function.collections.Counterwill be faster thanlist.countin your loop