I have a huge 2d numpy array that's suppose to work as a co-occurrence matrix. I've tried to use scipy.sparse as my data structure, but dok_matrix indexing is incredibly slow (4x times slower).
# Impossible
import numpy
N = 1000000 (1 milion)
coo = np.zeros((N, N), dtype=np.uint32)
I want to persist this array.
After searching for ways to save it I tried to use PyTables or hd5py, but I couldn't find a way to save it without running out of memory.
with open(name, 'w') as _file:
np.save(_file, coo)
For instance, using PyTables:
import tables
_file = tables.openFile(
name,
mode='w',
title='Co-occurrence matrix')
atom = tables.Atom.from_dtype(coo.dtype)
_filters = tables.Filters(complib='blosc', complevel=5)
ds = _file.createEArray(
_file.root,
'coo_matrix',
atom,
shape=(0, coo.shape[-1]),
expectedrows=coo.shape[-1],
filters=_filters)
# ds[:] = coo => not an option
for _index, _data in enumerate(coo):
ds.append(coo[_index][np.newaxis,:])
_file.close()
And using hd5py:
import h5py
h5f = h5py.File(name, 'w')
h5f.create_dataset('dataset_1', data=coo)
Both methods keep increasing memory usage until I have to kill the process. So, is there any way to do it incrementally? If it's not possible to do it can you recommend another way for persisting this matrix?
EDIT
I'm creating this co-occurrence matrix like this:
coo = np.zeros((N, N), dtype=np.uint32)
for doc_id, doc in enumerate(self.w.get_docs()):
for w1, w2 in combinations(doc, 2):
if w1 != w2:
coo[w1, w2] += 1
I want to save coo (2d numpy array) to retrieve it from disk later and find co-occurrence values, like: coo[w1, w2]
np.savez_compressedoption, which is very fast and compact to move data around...coodense or sparse? I can't create, let alone store, a10**6 x 10**6uint32 matrix on my system (requires 4TB of memory, if my math is right).