1

Now I have a dict object, where the key is a unique hashed id and the value is a length > 100 sparse list. I'd like to store this in plain text(e.g., csv/tsv/whatever that is not pickle.dump). Is there any good way to store this kind of sparse list? For example:

d = {"a": [0,0,0, ..., 1,0], "b": [0.5,0,0, ...,0.5,0], "c":...}

The length of each list is exactly the same. I was thinking whether it's worth storing this kind of sparse list as index-value pair. But I'm not sure whether there is any package do this.

18
  • 1
    Welcome to SO. Unfortunately this isn't a discussion forum. Please take the time to read How to Ask and the links it contains. Commented Oct 15, 2017 at 15:32
  • Hi @wwii, is there anything hard to understand for the question? Commented Oct 15, 2017 at 15:36
  • 1
    ... is not a python object). In Python everything is an object. Commented Oct 15, 2017 at 15:36
  • Here I mean I do not want to use pickle.dump. Instead, I'd hope to find some methods that could store sparse list as readable file. Sorry for the confusion and it should be updated now. Commented Oct 15, 2017 at 15:37
  • Also plz let me know if you have any idea how to do that. Thanks! Commented Oct 15, 2017 at 15:39

2 Answers 2

2

Rather than saving the 0s, you should transform the sparse list into a dictionary of the non-zero values. For example,

{'a':[0,0,0,1,0,0,0,2,0,0,0,3]}

could become

{'a':{3:1, 6:2, 9:3}}

You could transform the lists easily enough with a dictionary comprehension:

compressed_data = {
    hashed_id: {
        index: value for index, value in enumerate(values) if value != 0
    } for hashed_id, values in original_data.items()
}

Then you could just save that dictionary to a file. After you load the compressed list from the file:

decompressed_data = {}
for hashed_id, values in loaded_data.items():
    decompressed_values = [0] * DATA_LENGTH
    for index, value in values.items():
        decompressed_values[index] = value
    decompressed_data[hashed_id] = decompressed_values
Sign up to request clarification or add additional context in comments.

3 Comments

This is exactly I'm looking for!
I'd argue it's not – it does fulfill your spec of being a sparse matrix plain text format, but it still doesn't make sense to care about storage space and plaintextness at the same time. They are conflicting targets, and I think your problem is ill-posed in the first place. There have been binary sparse matrix formats at least as long as there's been FORTRAN. So, that's more than 40 years now. They are still used, for a reason, unlike sparse plain text files, which make little sense – sparsity only matters for large (in computer term large) matrices, and for these having plaintext doesn't…
This should be the accepted answer! Alternatively to {index: value for index, value in enumerate(values) if value != 0} you can also write dict(filter(itemgetter(1),enumerate(values))) depending on your preferred style.
0
import numpy as np   
from scipy.sparse import csr_matrix,lil_matrix,save_npz,load_npz
a = {'a':[0,0,1,0],'b':[1,0,0,0],'c':[1,1,0,0]}
sparse1 = csr_matrix(np.array(a.values())) ## You can use lil_matirx as well
print sparse1
print sparse1.toarray()
save_npz('values.npz',sparse1)
np.save('keys.npy',np.array(a.keys()))
sparse3 = load_npz('values.npz')
print sparse3
print sparse3.toarray()
keys = np.load('keys.npy')
print keys

print dict(zip(keys,sparse3))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.