Memory efficient way to create a set from a list of lists in Python

Question

I have a list of lists, in which each inner-list is a tokenized text, so its length is the number of words in the text.

corpus = [['this', 'is', 'text', 'one'], ['this', 'is', 'text', 'two']]

Now, I want to create a set that contains all unique tokens from the corpus. For the above example, the desired output would be:

{'this', 'is', 'text', 'one', 'two}

Currently, I have:

all_texts_list = list(chain(*corpus))
vocabulary = set(all_texts_list)

But this seems a memory-inefficient way of doing it.

Is there a more efficient way to obtain this set?

I found this link. However, there they want to find the set of unique lists and not the set of unique elements from the list.

Vishal Singh · Accepted Answer · 2021-03-18 15:30:11Z

1

You can use a simple for loop with set update operation.

vocabulary = set()

for tokens in corpus:
    vocabulary.update(tokens)

Output:

{'this', 'one', 'text', 'two', 'is'}

answered Mar 18, 2021 at 15:24

Vishal Singh

6,2522 gold badges19 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

1 Answer 1