2

I'm trying to convert TF-IDF sparse matrix to json format. Converting it to pandas datafram (toarray() or todense()) causes memory error. So I would like to avoid those approaches. Is there other way to convert it to json ?

Below is my appraoach to get sparse matrix, and my preferred json outcome

Thanks for helping me out ... !


TF-IDF matrix

pip = Pipeline([('hash', HashingVectorizer(ngram_range=(1, 1), non_negative=True)), ('tfidf', TfidfTransformer())])
result_uni_gram = pip.fit_transform(df_news_noun['content_nouns'])

return matrix

result_uni_gram

<112537x1048576 sparse matrix of type '<class 'numpy.float64'>'
    with 12605888 stored elements in Compressed Sparse Row format>



print(result_uni_gram)

(0, 1041232)    0.03397010691200069
(0, 1035546)    0.042603425242006505
(0, 1031141)    0.05579563771771019
(0, 1029045)    0.03985981185871279
(0, 1028867)    0.14591155976555212
(0, 1017328)    0.03827279930970525
:   :
(112536, 9046)  0.04444360144902461
(112536, 4920)  0.07335227778871069
(112536, 4301)  0.06667794684006756

Expecting Outcome

output_json = {
                0: {1041232 : 0.03397, 1035546 : 0.04260, 1031141 : 0.055795 ... }, 
                ...
                ... 112536: {9046 : 0.04444, 4920 : 0.07335, 112536 : 0.06667}
               }

Thanks for helping me out ... !

2 Answers 2

2

So I managed to do it like this: Given 'test_samples' is your 'scipy.sparse.csr.csr_matrix'

 import json
 import base64
 np_test_samples=test_samples.toarray()
 jason_test_samples=json.dumps({"data": np_test_samples.tolist()})
Sign up to request clarification or add additional context in comments.

2 Comments

this would ve a lot more useful if you included instructions to load the serialized data back into Python objects.
@rjurney He is converting it it to a dense array, then to a python list. You would use json.loads(jason_test_samples) to get back the list and use this solution to convert it to sparse
0

The script below does not have your 'preferred' JSON format, but hopefully it helps anyone else that is trying to convert a sparse-matrix array into JSON and back. Since ndarray is not serializable I converted them to list and created a custom JSON object with them. This is more efficient than doing mat.toarray().tolist() which creates a dense array.

import json
import numpy as np
from scipy.sparse import csr_matrix

row= np.array([0,1])
col = np.array([2,0])
data = np.array([2,3])
mat = csr_matrix((data, (row, col)), shape=(2, 3))

# mat is:
#[[0 0 2]
# [3 0 0]]

json_str = json.dumps({"data": mat.data.tolist(),
 "indices": mat.nonzero()[0].tolist(), "indptr": mat.nonzero()[1].tolist()})

obj = json.loads(json_str)

mat2 = csr_matrix((obj['data'], (obj['indices'], obj['indptr'])))

print((mat != mat2).nnz==0)

print(mat)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.