10

I want to read a sparse matrix. When I am building ngrams using scikit learn. Its transform() gives output in sparse matrix. I want to read that matrix without doing todense().

Code:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
document = ['john guy','nice guy']
vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(document)
transformer = vectorizer.transform(document)
print transformer

Output :

  (0, 0)    1
  (0, 1)    1
  (0, 2)    1
  (1, 0)    1
  (1, 3)    1
  (1, 4)    1

How can I read this output to get its values. I need value at (0,0), (0,1) and so on and save into list.

3 Answers 3

14

The documentation for this transform method says it returns a sparse matrix, but doesn't specify the kind. Different kinds let you access the data in different ways, but it is easy to convert one to another. Your print display is the typical str for a sparse matrix.

An equivalent matrix can be generated with:

from scipy import sparse
i=[0,0,0,1,1,1]
j=[0,1,2,0,3,4]
A=sparse.csr_matrix((np.ones_like(j),(i,j)))
print(A)

producing:

  (0, 0)        1
  (0, 1)        1
  (0, 2)        1
  (1, 0)        1
  (1, 3)        1
  (1, 4)        1

A csr type can be indexed like a dense matrix:

In [32]: A[0,0]
Out[32]: 1    
In [33]: A[0,3]
Out[33]: 0

Internally the csr matrix stores its data in data, indices, indptr, which is convenient for calculation, but a bit obscure. Convert it to coo format to get data that looks just like your input:

In [34]: A.tocoo().row
Out[34]: array([0, 0, 0, 1, 1, 1], dtype=int32)

In [35]: A.tocoo().col
Out[35]: array([0, 1, 2, 0, 3, 4], dtype=int32)

Or you can convert it to a dok type, and access that data like a dictionary:

A.todok().keys()
#  dict_keys([(0, 1), (0, 0), (1, 3), (1, 0), (0, 2), (1, 4)])
A.todok().items()

produces: (Python3 here)

dict_items([((0, 1), 1), 
            ((0, 0), 1), 
            ((1, 3), 1), 
            ((1, 0), 1), 
            ((0, 2), 1), 
            ((1, 4), 1)])

The lil format stores the data as 2 lists of lists, one with the data (all 1s in this example), and the other with the row indices.

Or do you what to 'read' the data in some other way?

Sign up to request clarification or add additional context in comments.

Comments

4

This is a SciPy CSR matrix. To convert this to (row, col, value) triples, the easiest option is to convert to COO format, then get the triples from that:

>>> from scipy.sparse import rand
>>> X = rand(100, 100, format='csr')
>>> X
<100x100 sparse matrix of type '<type 'numpy.float64'>'
    with 100 stored elements in Compressed Sparse Row format>
>>> zip(X.row, X.col, X.data)[:10]
[(1, 78, 0.73843533223380842),
 (1, 91, 0.30943772717074158),
 (2, 35, 0.52635078317400608),
 (4, 75, 0.34667509458006551),
 (5, 30, 0.86482318943934389),
 (7, 74, 0.46260571098933323),
 (8, 75, 0.74193890941716234),
 (9, 72, 0.50095749482583696),
 (9, 80, 0.85906284644174613),
 (11, 66, 0.83072142899400137)]

(Note that the output is sorted.)

Comments

2

You can use data and indices as:

>>> indices=transformer.toarray()
>>> indices
array([[1, 1, 1, 0, 0],
      [1, 0, 0, 1, 1]])
>>> values=transformer.data
>>> values
array([1, 1, 1, 1, 1, 1])

2 Comments

Sorry, I have mentioned in the question that I don't want to use todense() to convert the sparse matrix into ordinary matrix. Any other solution.
@Nikkz, Sorry, I didn't read the question carefully. I've provided some methods of reading data. Hope it helps.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.