1

We can save many arrays, one after another, without having all of them in RAM at the same time with:

with open('test.npy', 'wb') as f:
    A = compute_my_np_array(1)
    np.save(f, A)
    # we could even do: del A  (but not needed here, because it is freed anyway in the next line)    

    A = compute_my_np_array(2)
    np.save(f, A)

but it is uncompressed. For compressed save, we have to have all arrays available at the same time, see numpy.savez_compressed:

A = compute_my_np_array(1)
B = compute_my_np_array(2)
np.savez_compressed('test.npz', A=A, B=B)

TL;DR: how to save compressed numpy arrays without having all of them in RAM at the same time?

3
  • 1
    First your first method is a kludg, it works, nut isn''t documented. With compressed savez, the stoarge is a compressed zip archive. The arrays are each in their own npy file. I think, but am not sure, that compression is applied to the whole archive, not the files individually. If you know the zip achive tools, you should be able add npy files to an archive. I don't if it can be done wirh a compressed one. Commented Oct 31, 2022 at 9:46
  • @hpaulj AFAIK the compression of zip archive is done file per file. Not the whole archive, as opposed to file formats like tar-gz that pack all the files in a big bundle then compressed. This is faster to append/fetch files quickly at the expense of a lower compression ratio. Not sure Numpy supports appending though. Commented Oct 31, 2022 at 10:14
  • h5py offers similar functionality if you want to use something out of the box: h5py.org. "It lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.". It supports compression too. Commented Oct 31, 2022 at 11:13

1 Answer 1

1

On solution is to use packages like gzip so to open the file as a gzip stream instead of a raw binary file. Here is an example:

import gzip

with gzip.open('test.npy.gz', 'wb') as f:
    A = compute_my_np_array(1)
    np.save(f, A)

The result is a npy file compressed with gzip. You need also to open it with gzip so to read it (with np.load for example).

Note that the gzip compression is a bit slow for large data although the compression ratio is relatively good. Other compression standard may better fit your needs. For example the Zstd (faster) and the LZ4 (much faster) could provide a faster compression certainly at the expense of a lower compression ratio (there is no free lunch).

Sign up to request clarification or add additional context in comments.

1 Comment

Yes in my past experience, numpy.save + LZ4 was quite good, I'll try this again!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.