Creating binary matrix of each unique value from list of lists

Question

This question is conceptually similar to the question here: Python Pandas: How to create a binary matrix from column of lists?, but due to the size of my data, I do not want to convert into a pandas data frame.

I have a list of lists like the following,

list_ = [[5, 3, 5, 2], [6, 3, 2, 1, 3], [5, 3, 2, 5, 2]]

And I would like a binary matrix with each unique value as a column, and each sublist as a row.

How could this be done efficiently on over 100000 sublists with around 1000 items each?

Edit:

Example output is similar to the output in the question linked above, where the list could essentially be considered as:

list_ = [["a", "b"], ["c"], ["d"], ["e"]]

   a  b  c  d  e
0  1  1  0  0  0
1  0  0  1  0  0
2  0  0  0  1  0
3  0  0  0  0  1

You have a ragged list here. Can you explain what your output should look like? — cs95
– cs95, Commented Jun 5, 2018 at 14:50
How many unique values are there in total? In the worst case, there will be 10**8 unique values, leading to 10**13 entries in the matrix, so you better have a few terabytes of memory to fit the matrix in. More to the point, why are you transforming your data to a less memory-efficient representation in the first place? Please provide more context about the problem you are solving. — Sven Marnach
– Sven Marnach, Commented Jun 5, 2018 at 14:56
@SvenMarnach I want to do a Fisher's exact test on each feature (number) and use it as a feature selection method. I have another list with a categorical assignment for each sublist. Perhaps it would be better to iterate through. If you could provide some insight on this that would be appreciated. — Jack Arnestad
– Jack Arnestad, Commented Jun 5, 2018 at 14:57

phi · Accepted Answer · 2018-06-05 15:10:23Z

2

Using sklearn's CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(tokenizer=lambda x: x, lowercase=False)
m = cv.fit_transform(list_)

# To transform to dense matrix
m.todense()

# To get the values correspond to each column
cv.get_feature_names()

# If you need dummy columns, not count
m = (m > 0)

You may want to keep it as sparsed matrix for memory reason.

edited Jun 5, 2018 at 15:10

answered Jun 5, 2018 at 14:59

phi

11.9k3 gold badges28 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Mohamed El Amine Douad · Accepted Answer · 2022-06-29 11:09:01Z

0

The values in subsets(rows) will be a position of 1(True) and 0(False) in the rest of columns:

import numpy as np

list_ = [[5, 3, 5, 2], [6, 3, 2, 1, 3], [5, 3, 2, 5, 2]]

##################################
# convert to binary matrix
##################################
#find number of columns(dimenseion of matrix) 
nbr_of_columns = max(map(max, list_))+1 #maximun value in lists_

Mat = np.zeros((len(list_), nbr_of_columns), dtype=bool)
for i in range(0, len(list_)):
    for j in range(0, len(list_[i])):
        Mat[i, list_[i][j]] = True
        
print(Mat)

enter image description here

answered Jun 29, 2022 at 11:09

Mohamed El Amine Douad

1

Collectives™ on Stack Overflow

Creating binary matrix of each unique value from list of lists

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related