Python: How to save 2d numpy array efficiently to disk?

Question

I have a huge 2d numpy array that's suppose to work as a co-occurrence matrix. I've tried to use scipy.sparse as my data structure, but dok_matrix indexing is incredibly slow (4x times slower).

# Impossible
import numpy
N = 1000000 (1 milion)
coo = np.zeros((N, N), dtype=np.uint32)

I want to persist this array.

After searching for ways to save it I tried to use PyTables or hd5py, but I couldn't find a way to save it without running out of memory.

with open(name, 'w') as _file:
   np.save(_file, coo)

For instance, using PyTables:

    import tables
    _file = tables.openFile(
                name,
                mode='w',
                title='Co-occurrence matrix')
    atom = tables.Atom.from_dtype(coo.dtype)
    _filters = tables.Filters(complib='blosc', complevel=5)
    ds = _file.createEArray(
            _file.root,
            'coo_matrix',
            atom,
            shape=(0, coo.shape[-1]),
            expectedrows=coo.shape[-1],
            filters=_filters)
    # ds[:] = coo => not an option
    for _index, _data in enumerate(coo):
        ds.append(coo[_index][np.newaxis,:])
    _file.close()

And using hd5py:

import h5py
h5f = h5py.File(name, 'w')
h5f.create_dataset('dataset_1', data=coo)

Both methods keep increasing memory usage until I have to kill the process. So, is there any way to do it incrementally? If it's not possible to do it can you recommend another way for persisting this matrix?

EDIT

I'm creating this co-occurrence matrix like this:

    coo = np.zeros((N, N), dtype=np.uint32)
    for doc_id, doc in enumerate(self.w.get_docs()):
        for w1, w2 in combinations(doc, 2):
                if w1 != w2:
                    coo[w1, w2] += 1

I want to save coo (2d numpy array) to retrieve it from disk later and find co-occurrence values, like: coo[w1, w2]

Just to satisfy my own curiosity: what is posterior loading? — Adam Smith
– Adam Smith, Commented Oct 16, 2015 at 22:35
Besides storing what are you trying to do with this array? Change individual values, access them, access slices, math? — hpaulj
– hpaulj, Commented Oct 17, 2015 at 0:48
There is the np.savez_compressed option, which is very fast and compact to move data around... — Saullo G. P. Castro
– Saullo G. P. Castro, Commented Oct 17, 2015 at 0:56
@SaulloCastro, same problem, try in your ipython: np.savez_compressed('testing.npz', coo=x=np.zeros((N, N), dtype=np.uint32)) — Frias
– Frias, Commented Oct 17, 2015 at 4:19
Is coo dense or sparse? I can't create, let alone store, a 10**6 x 10**6 uint32 matrix on my system (requires 4TB of memory, if my math is right). — Rory Yorke
– Rory Yorke, Commented Oct 17, 2015 at 6:21

hpaulj · Accepted Answer · 2015-10-17 15:56:20Z

np.save is a fast, efficient way of saving a dense array. All it does is write a small header and then the data buffer of the array.

But for a large array, that data buffer will have N*N*4 (for your dtype) bytes - in one contiguous memory block. That design is also good for element access - the code knows exactly where the i,j element is located.

Beware that np.zeros((N,N)) does not allocate all the necessary memory at once. Memory use may grow during use (including saving)

np.savez does not help with data storage. It does a save for each variable, and collects the resulting files in a zip archive (which may also be compressed).

Tables and h5py can save and load chunks, but that doesn't help if you have to have to whole array in memeory at some point - for creation or use.

Since your array will be very sparse, a scipy sparse matrix could save on memory, since it only stores the nonzero elements. But it has to also store that element's coordinates as well, so storage per nonzero element isn't as compact. There are a number of formats, each with its pros and cons.

dok uses a Python dictionary to store data, with keys of the form (i,j). It is one of the better formats for incrementally building a sparse matrix. I found in in other SO questions that element access with a dok is slower than with a plain dictionary. It is faster to build a regular dictionary, and then update the dok.

lil is another good format for incremental builds. It stores the data in 2 lists of lists.

coo is convenient for building a matrix, once you have a full set of i,j,data arrays.

csr and csc are good for computation (esp. linear algebra kinds), and for element access. But no good for changing sparsity (adding nonzero elements).

But you can build a matrix in one format, and readily convert it to another for use, or storage.

There have been SO questions about storing sparse matrices. The easiest is with the MATLAB compatible .mat format (csc for sparse). To use np.save you need to save the underlying arrays (for coo, csc, csr formats). Python pickle has to be used to save dok or lil.

Do a search on [scipy] large sparse to see other SO questions about this kind of matrix. You aren't the first to use numpy/scipy for co-occurance calculations of documents (it's one of the 3 main uses of scipy sparse, the others being linear algebra and machine learning).

Collectives™ on Stack Overflow

Python: How to save 2d numpy array efficiently to disk?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related