Compress/Zip numpy arrays in Memory

Question

My memory is too small for my data, so I tried packing it in memory.

The following code does work, but I have to remember the type of the data, which is kind of akward (lots of different data types).

Any better suggestions? Smaller running time would also be appreciated

import numpy as np    
import zlib

A = np.arange(10000)
dtype = A.dtype

B = zlib.compress(A, 1)
C = np.fromstring(zlib.decompress(B), dtype)
np.testing.assert_allclose(A, C)

You may want to use the blosc package instead of python's zlib and bz2 implementations for a significant speedup. — mab
– mab, Commented Jan 10, 2018 at 17:24
The speed increase of blosc is indeed impressive and the compression ratio is good as well. You helped me a lot. — Okapi575
– Okapi575, Commented Jan 11, 2018 at 16:06
Nice to know :). Some further pointers: blosc.set_nthreads(6). compr_arr = blosc.pack_array(numpy_arr); numpy_arr = blosc.unpack_array(compr_arr) preserves shape and dtype internally. — mab
– mab, Commented Jan 11, 2018 at 16:14

mhawke · Accepted Answer · 2016-08-19 10:39:54Z

20

You could try using numpy's builtin array compressor np.savez_compressed(). This will save you the hassle of keeping track of the data types, but would probably give similar performance to your method. Here's an example:

import io
import numpy as np

A = np.arange(10000)
compressed_array = io.BytesIO()    # np.savez_compressed() requires a file-like object to write to
np.savez_compressed(compressed_array, A)

# load it back
compressed_array.seek(0)    # seek back to the beginning of the file-like object
decompressed_array = np.load(compressed_array)['arr_0']

>>> print(len(compressed_array.getvalue()))    # compressed array size
15364
>>> assert A.dtype == decompressed_array.dtype
>>> assert all(A == decompressed_array)

Note that any size reduction depends on the distribution of your data. Random data is inherently incompressible, so you might not see much benefit by attempting to compress it.

edited Aug 19, 2016 at 10:39

answered Aug 19, 2016 at 10:33

mhawke

87.5k10 gold badges122 silver badges142 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Okapi575 Over a year ago

The "file like" object is interesting, however the packing is about a factor 10 slower. The data is compressible, not much noise, I see an average ratio of about 8 to 10.

mhawke Over a year ago

@Okapi575L yes, now that I have tested it with timeit II can confirm that np.savez_compressed() is about 10X slower. The only advantage is that the data type is automatically saved, however, it would be easy to write a class to wrap the zlib compress and decompress and to store the data type.

mhawke Over a year ago

@Okapi575: I also tried bz2 but that is also much slower than zlib, albeit it is a much more effective compressor.

Okapi575 Over a year ago

bz2 is far more effective in my example, and probably still quicker than writing stuff to disk. Nice to know.

Okapi575 · Accepted Answer · 2017-10-23 07:39:10Z

I want to post my final code, in case it helps anyone. It can compress in RAM with different pack algorithems, or alternatively, if there is not enough RAM, store the data in a hdf5 file. Any speedups or advice for better code is appreciated.

import zlib,bz2
import numpy as np
import h5py
import os

class packdataclass():
    def __init__(self,packalg='nocompress',Filename=None):
        self.packalg=packalg
        if self.packalg=='hdf5_on_drive':
            self.Filename=Filename
            self.Running_Number=0
            if os.path.isfile(Filename):
                os.remove(Filename)
            with h5py.File(self.Filename,'w') as hdf5_file:
                hdf5_file.create_dataset("TMP_File", data="0")

    def clean_up(self):
        if self.packalg=='hdf5_on_drive':
            if os.path.isfile(self.Filename):
                os.remove(self.Filename)

    def compress (self, array):
        Returndict={'compression':self.packalg,'type':array.dtype}
        if array.dtype==np.bool:
            Returndict['len_bool_array']=len(array)            
            array=np.packbits(array.astype(np.uint8)) # Code converts 8 bool to an int8
            Returndict['type']='bitfield'
        if self.packalg == 'nocompress':
            Returndict['data'] = array

        elif self.packalg == 'zlib':
            Returndict['data'] = zlib.compress(array,1)

        elif self.packalg == 'bz2':
            Returndict['data'] = bz2.compress(array,1)
        elif self.packalg == 'hdf5_on_drive':
            with h5py.File(self.Filename,'r+') as hdf5_file:
                datatype=array.dtype
                Returndict['data']=str(self.Running_Number)
                hdf5_file.create_dataset(Returndict['data'], data=array, dtype=datatype, compression='gzip',compression_opts=4)
            self.Running_Number+=1

        else:
            raise ValueError("Algorithm for packing {} is unknown".format(self.packalg))

        return(Returndict)

    def decompress (self, data):

        if data['compression'] == 'nocompress':
            data_decompressed=data['data']
        else:
            if data['compression'] == 'zlib':
                data_decompressed = zlib.decompress(data['data'])

            elif data['compression'] == 'bz2':
                data_decompressed = bz2.decompress(data['data'])
            elif data['compression'] == 'hdf5_on_drive':
                with h5py.File(self.Filename, "r") as Readfile:
                    data_decompressed=np.array(Readfile[data['data']])
            else:
                raise
            if type(data['type'])!=np.dtype and data['type']=='bitfield':
                data_decompressed =np.fromstring(data_decompressed, np.uint8)
            else:                            
                data_decompressed =np.fromstring(data_decompressed, data['type'])

        if type(data['type'])!=np.dtype and data['type']=='bitfield':
            return np.unpackbits(data_decompressed).astype(np.bool)[:data['len_bool_array']]
        else:
            return(data_decompressed)

jimmy_jammy · Accepted Answer · 2018-11-12 11:43:28Z

1

You could try bcolz, which I just found when googling for an answer to a similar problem: https://bcolz.readthedocs.io/en/latest/intro.html

It's an additional layer on top of numpy arrays which organises compression for you.

answered Nov 12, 2018 at 11:43

jimmy_jammy

212 bronze badges

1 Comment

Farhood ET Over a year ago

This project is rarely updated anymore which is a shame.

Collectives™ on Stack Overflow

Compress/Zip numpy arrays in Memory

3 Answers 3

4 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related