0

I'm having a problem making a vocabulary of words in python. My code goes through every word in a document of about 2.3MB and checks whether or not the word is in the dictionary, if it is not, it appends to the list

The problem is, it is taking way to long (I havent even gotten it to finish yet). How can I solve this?

Code:

words = [("_", "hello"), ("hello", "world"), ("world", "."), (".", "_")] # List of a ton of tuples of words
vocab = []
for w in words:
    if not w in vocab:
        vocab.append(w)
3
  • 1
    How many words you got there? Any why not use set() instead of list? Commented Dec 27, 2016 at 0:17
  • can you provide a copy of the words you are checking against. Commented Dec 27, 2016 at 0:18
  • words is actually a list of tuples (n-grams) Commented Dec 27, 2016 at 0:27

2 Answers 2

3

Unless you need vocab to have a particular order, you can just do:

vocab = set(words)
Sign up to request clarification or add additional context in comments.

3 Comments

but what if a word appears more than once is the words list. I dont want any duplicates in my vocabulary. @AlexHall
@N.Chalifour yup, sets don't have duplicates.
thanks! it worked like a charm.
2

The following is a test to compare the execution time of for loop and set():

import random
import time
import string


words = [''.join(random.sample(string.letters, 5)) for i in range(1000)]*10  # *10 to make duplicates!

vocab1 = []

t1 = time.time()
for w in words:
    if w not in vocab1:
        vocab1.append(w)
t2 = time.time()

t3 = time.time()
vocab2 = set(words)
t4 = time.time()

print t2 - t1
print t4 - t3

Output:

0.0880000591278  # Using for loop
0.000999927520752  # Using set()

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.