Save compressed numpy array one after another (without having everything in RAM)

Question

We can save many arrays, one after another, without having all of them in RAM at the same time with:

with open('test.npy', 'wb') as f:
    A = compute_my_np_array(1)
    np.save(f, A)
    # we could even do: del A  (but not needed here, because it is freed anyway in the next line)    

    A = compute_my_np_array(2)
    np.save(f, A)

but it is uncompressed. For compressed save, we have to have all arrays available at the same time, see numpy.savez_compressed:

A = compute_my_np_array(1)
B = compute_my_np_array(2)
np.savez_compressed('test.npz', A=A, B=B)

TL;DR: how to save compressed numpy arrays without having all of them in RAM at the same time?

First your first method is a kludg, it works, nut isn''t documented. With compressed savez, the stoarge is a compressed zip archive. The arrays are each in their own npy file. I think, but am not sure, that compression is applied to the whole archive, not the files individually. If you know the zip achive tools, you should be able add npy files to an archive. I don't if it can be done wirh a compressed one. — hpaulj
– hpaulj, Commented Oct 31, 2022 at 9:46
@hpaulj AFAIK the compression of zip archive is done file per file. Not the whole archive, as opposed to file formats like tar-gz that pack all the files in a big bundle then compressed. This is faster to append/fetch files quickly at the expense of a lower compression ratio. Not sure Numpy supports appending though. — Jérôme Richard
– Jérôme Richard, Commented Oct 31, 2022 at 10:14
h5py offers similar functionality if you want to use something out of the box: h5py.org. "It lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.". It supports compression too. — jar
– jar, Commented Oct 31, 2022 at 11:13

Jérôme Richard · Accepted Answer · 2022-10-31 10:32:10Z

1

On solution is to use packages like gzip so to open the file as a gzip stream instead of a raw binary file. Here is an example:

import gzip

with gzip.open('test.npy.gz', 'wb') as f:
    A = compute_my_np_array(1)
    np.save(f, A)

The result is a npy file compressed with gzip. You need also to open it with gzip so to read it (with np.load for example).

Note that the gzip compression is a bit slow for large data although the compression ratio is relatively good. Other compression standard may better fit your needs. For example the Zstd (faster) and the LZ4 (much faster) could provide a faster compression certainly at the expense of a lower compression ratio (there is no free lunch).

answered Oct 31, 2022 at 10:32

Jérôme Richard

53.3k6 gold badges48 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Basj Over a year ago

Yes in my past experience, numpy.save + LZ4 was quite good, I'll try this again!

Collectives™ on Stack Overflow

Save compressed numpy array one after another (without having everything in RAM)

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related