14

My memory is too small for my data, so I tried packing it in memory.

The following code does work, but I have to remember the type of the data, which is kind of akward (lots of different data types).

Any better suggestions? Smaller running time would also be appreciated

import numpy as np    
import zlib

A = np.arange(10000)
dtype = A.dtype

B = zlib.compress(A, 1)
C = np.fromstring(zlib.decompress(B), dtype)
np.testing.assert_allclose(A, C)
3
  • 1
    You may want to use the blosc package instead of python's zlib and bz2 implementations for a significant speedup. Commented Jan 10, 2018 at 17:24
  • 1
    The speed increase of blosc is indeed impressive and the compression ratio is good as well. You helped me a lot. Commented Jan 11, 2018 at 16:06
  • 1
    Nice to know :). Some further pointers: blosc.set_nthreads(6). compr_arr = blosc.pack_array(numpy_arr); numpy_arr = blosc.unpack_array(compr_arr) preserves shape and dtype internally. Commented Jan 11, 2018 at 16:14

3 Answers 3

20

You could try using numpy's builtin array compressor np.savez_compressed(). This will save you the hassle of keeping track of the data types, but would probably give similar performance to your method. Here's an example:

import io
import numpy as np

A = np.arange(10000)
compressed_array = io.BytesIO()    # np.savez_compressed() requires a file-like object to write to
np.savez_compressed(compressed_array, A)

# load it back
compressed_array.seek(0)    # seek back to the beginning of the file-like object
decompressed_array = np.load(compressed_array)['arr_0']

>>> print(len(compressed_array.getvalue()))    # compressed array size
15364
>>> assert A.dtype == decompressed_array.dtype
>>> assert all(A == decompressed_array)

Note that any size reduction depends on the distribution of your data. Random data is inherently incompressible, so you might not see much benefit by attempting to compress it.

Sign up to request clarification or add additional context in comments.

4 Comments

The "file like" object is interesting, however the packing is about a factor 10 slower. The data is compressible, not much noise, I see an average ratio of about 8 to 10.
@Okapi575L yes, now that I have tested it with timeit II can confirm that np.savez_compressed() is about 10X slower. The only advantage is that the data type is automatically saved, however, it would be easy to write a class to wrap the zlib compress and decompress and to store the data type.
@Okapi575: I also tried bz2 but that is also much slower than zlib, albeit it is a much more effective compressor.
bz2 is far more effective in my example, and probably still quicker than writing stuff to disk. Nice to know.
6

I want to post my final code, in case it helps anyone. It can compress in RAM with different pack algorithems, or alternatively, if there is not enough RAM, store the data in a hdf5 file. Any speedups or advice for better code is appreciated.

import zlib,bz2
import numpy as np
import h5py
import os

class packdataclass():
    def __init__(self,packalg='nocompress',Filename=None):
        self.packalg=packalg
        if self.packalg=='hdf5_on_drive':
            self.Filename=Filename
            self.Running_Number=0
            if os.path.isfile(Filename):
                os.remove(Filename)
            with h5py.File(self.Filename,'w') as hdf5_file:
                hdf5_file.create_dataset("TMP_File", data="0")

    def clean_up(self):
        if self.packalg=='hdf5_on_drive':
            if os.path.isfile(self.Filename):
                os.remove(self.Filename)

    def compress (self, array):
        Returndict={'compression':self.packalg,'type':array.dtype}
        if array.dtype==np.bool:
            Returndict['len_bool_array']=len(array)            
            array=np.packbits(array.astype(np.uint8)) # Code converts 8 bool to an int8
            Returndict['type']='bitfield'
        if self.packalg == 'nocompress':
            Returndict['data'] = array

        elif self.packalg == 'zlib':
            Returndict['data'] = zlib.compress(array,1)

        elif self.packalg == 'bz2':
            Returndict['data'] = bz2.compress(array,1)
        elif self.packalg == 'hdf5_on_drive':
            with h5py.File(self.Filename,'r+') as hdf5_file:
                datatype=array.dtype
                Returndict['data']=str(self.Running_Number)
                hdf5_file.create_dataset(Returndict['data'], data=array, dtype=datatype, compression='gzip',compression_opts=4)
            self.Running_Number+=1

        else:
            raise ValueError("Algorithm for packing {} is unknown".format(self.packalg))

        return(Returndict)

    def decompress (self, data):

        if data['compression'] == 'nocompress':
            data_decompressed=data['data']
        else:
            if data['compression'] == 'zlib':
                data_decompressed = zlib.decompress(data['data'])

            elif data['compression'] == 'bz2':
                data_decompressed = bz2.decompress(data['data'])
            elif data['compression'] == 'hdf5_on_drive':
                with h5py.File(self.Filename, "r") as Readfile:
                    data_decompressed=np.array(Readfile[data['data']])
            else:
                raise
            if type(data['type'])!=np.dtype and data['type']=='bitfield':
                data_decompressed =np.fromstring(data_decompressed, np.uint8)
            else:                            
                data_decompressed =np.fromstring(data_decompressed, data['type'])

        if type(data['type'])!=np.dtype and data['type']=='bitfield':
            return np.unpackbits(data_decompressed).astype(np.bool)[:data['len_bool_array']]
        else:
            return(data_decompressed)

Comments

1

You could try bcolz, which I just found when googling for an answer to a similar problem: https://bcolz.readthedocs.io/en/latest/intro.html

It's an additional layer on top of numpy arrays which organises compression for you.

1 Comment

This project is rarely updated anymore which is a shame.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.