1

I have a list of lists, in which each inner-list is a tokenized text, so its length is the number of words in the text.

corpus = [['this', 'is', 'text', 'one'], ['this', 'is', 'text', 'two']]

Now, I want to create a set that contains all unique tokens from the corpus. For the above example, the desired output would be:

{'this', 'is', 'text', 'one', 'two}

Currently, I have:

all_texts_list = list(chain(*corpus))
vocabulary = set(all_texts_list)

But this seems a memory-inefficient way of doing it.

Is there a more efficient way to obtain this set?


I found this link. However, there they want to find the set of unique lists and not the set of unique elements from the list.

0

1 Answer 1

1

You can use a simple for loop with set update operation.

vocabulary = set()

for tokens in corpus:
    vocabulary.update(tokens)

Output:

{'this', 'one', 'text', 'two', 'is'}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.