3

This question is conceptually similar to the question here: Python Pandas: How to create a binary matrix from column of lists?, but due to the size of my data, I do not want to convert into a pandas data frame.

I have a list of lists like the following,

list_ = [[5, 3, 5, 2], [6, 3, 2, 1, 3], [5, 3, 2, 5, 2]]

And I would like a binary matrix with each unique value as a column, and each sublist as a row.

How could this be done efficiently on over 100000 sublists with around 1000 items each?

Edit:

Example output is similar to the output in the question linked above, where the list could essentially be considered as:

list_ = [["a", "b"], ["c"], ["d"], ["e"]]

   a  b  c  d  e
0  1  1  0  0  0
1  0  0  1  0  0
2  0  0  0  1  0
3  0  0  0  0  1
3
  • You have a ragged list here. Can you explain what your output should look like? Commented Jun 5, 2018 at 14:50
  • 1
    How many unique values are there in total? In the worst case, there will be 10**8 unique values, leading to 10**13 entries in the matrix, so you better have a few terabytes of memory to fit the matrix in. More to the point, why are you transforming your data to a less memory-efficient representation in the first place? Please provide more context about the problem you are solving. Commented Jun 5, 2018 at 14:56
  • @SvenMarnach I want to do a Fisher's exact test on each feature (number) and use it as a feature selection method. I have another list with a categorical assignment for each sublist. Perhaps it would be better to iterate through. If you could provide some insight on this that would be appreciated. Commented Jun 5, 2018 at 14:57

2 Answers 2

2

Using sklearn's CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(tokenizer=lambda x: x, lowercase=False)
m = cv.fit_transform(list_)

# To transform to dense matrix
m.todense()

# To get the values correspond to each column
cv.get_feature_names()

# If you need dummy columns, not count
m = (m > 0)

You may want to keep it as sparsed matrix for memory reason.

Sign up to request clarification or add additional context in comments.

Comments

0

The values in subsets(rows) will be a position of 1(True) and 0(False) in the rest of columns:

import numpy as np

list_ = [[5, 3, 5, 2], [6, 3, 2, 1, 3], [5, 3, 2, 5, 2]]

##################################
# convert to binary matrix
##################################
#find number of columns(dimenseion of matrix) 
nbr_of_columns = max(map(max, list_))+1 #maximun value in lists_

Mat = np.zeros((len(list_), nbr_of_columns), dtype=bool)
for i in range(0, len(list_)):
    for j in range(0, len(list_[i])):
        Mat[i, list_[i][j]] = True
        
print(Mat)

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.