1

I'm trying to make a vocab list from a set of strings and then remove all words that aren't repeated in at least 30 strings in the set. There are about 300,000 words in total in the set. For some reason the code that checks if a word has been repeated throughout 30 times has runtime of at least over 5 minutes and I was wondering how I could make this code more efficient so it has a reasonable runtime. Thanks!

word_list = []
for item in ex_set:
    word_list += (list(dict.fromkeys(item.split()))) #remove unique words

vocab_list = []
for word in word_list: #where it runs forever
    if word_list.count(word) >= 30:
        vocab_list.append(word)
4
  • Are you planning to later use that data for a model? Specifically, are you planning to use something like CountVectorizer or TfidfVectorizer? Commented Jul 30, 2019 at 19:53
  • 7
    Don't use set as a variable name, it's the name of a built-in function. Commented Jul 30, 2019 at 19:54
  • 2
    collections.Counter will be faster than list.count in your loop Commented Jul 30, 2019 at 19:56
  • 1
    Also just one brief comment about the logic of it: if you have a word that is repeated in word_list more than 30 times, you are visiting it redundantly (i.e. after you append it to vocab_list, you don't need to visit it again, but you do...). So try adding one line to remove that repeated word from word_list to ensure you only loop over unique words Commented Jul 30, 2019 at 19:58

2 Answers 2

4

If you are trying to get all the words in a list of words that appear at least 30 times, you can count them first using collections.Counter, then find all the ones where they appear more than 30 times.

from collections import Counter
word_counts = Counter(ex_set)

vocab_list = [word for word, count in words.items() if count >= 30]

Just another note, do not use the word set as a variable name as it is a keyword

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, that was a lot more concise than what I had haha
0

Here is another way of thinking about the problem:

Every single call to count loops over the entire list again (quadratic time).

If you build a dict of word counts, this is a smaller data structure to check on the second iteration:

from collections import defaultdict

counter_dict = defaultdict(int)
for word in word_list:
    counter_dict[word] += 1

vocab_list = []
for word, count in counter_dict.items()
    if count >= 30:
        vocab_list.append(word)

Having seen Jmonsky's answer, if that works, then it should be accepted.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.