python - saving numpy array to a file (smallest size possible)

Question

Right now I have a python program building a fairly large 2D numpy array and saving it as a tab delimited text file using numpy.savetxt. The numpy array contains only floats. I then read the file in one row at a time in a separate C++ program.

What I would like to do is find a way to accomplish this same task, changing my code as little as possible such that I can decrease the size of the file I am passing between the two programs.

I found that I can use numpy.savetxt to save to a compressed .gz file instead of a text file. This lowers the file size from ~2MB to ~100kB.

Is there a better way to do this? Could I, perhaps, write the numpy array in binary to the file to save space? If so, how would I do this so that I can still read it into the C++ program?

Thank you for the help. I appreciate any guidance I can get.

EDIT:

There are a lot of zeros (probably 70% of the values in the numpy array are 0.0000) I am not sure of how I can somehow exploit this though and generate a tiny file that my c++ program can read in

Just a thought - do you have to write it out at all? If the programs are running concurrently (or can be made to run concurrently), you can use any of various methods to stream the data from one to the other: a named pipe, a TCP socket, shared memory, etc. — David Z
– David Z, Commented Mar 12, 2013 at 20:59

Roland Smith · Accepted Answer · 2013-03-12 21:24:56Z

3

Since you have a lot of zeroes, you could only write out the non-zero elements in the form (index, number).

Suppose you have an array with a small amount of nonzero numbers:

In [5]: a = np.zeros((10, 10))

In [6]: a
Out[6]: 
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

In [7]: a[3,1] = 2.0

In [8]: a[7,4] = 17.0

In [9]: a[9,0] = 1.5

First, isolate the interesting numbers and their indices:

In [11]: x, y = a.nonzero()

In [12]: zip(x,y)
Out[12]: [(3, 1), (7, 4), (9, 0)]

In [13]: nonzero = zip(x,y)

Now you only have a small number of data elements left. The easiest thing is to write them to a text file:

In [17]: with open('numbers.txt', 'w+') as outf:
   ....:     for r, k in nonzero:
   ....:         outf.write('{:d} {:d} {:g}\n'.format(r, k, a[r,k]))
   ....:         

In [18]: cat numbers.txt
3 1 2
7 4 17
9 0 1.5

This also gives you an opportunity to eyeball the data. In your C++ program you can read this data with fscanf.

But you can reduce the size even more by writing binary data using struct:

In [17]: import struct

In [19]: c = struct.Struct('=IId')

In [20]: with open('numbers.bin', 'w+') as outf:
   ....:     for r, k in nonzero:
   ....:         outf.write(c.pack(r, k, a[r,k]))

The argument to the Struct constructor means; use native date format '='. The first and second data elements are unsigned integers 'I', the third element is a double 'd'.

In your C++ program this data is probably best read as binary data into a packed struct.

EDIT: Answer updated for a 2D array.

edited Mar 12, 2013 at 21:24

answered Mar 12, 2013 at 20:46

Roland Smith

43.7k3 gold badges69 silver badges98 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Dave Over a year ago

Although the array is 2-dimensional, you can still use a single index to refer to the non-zero elements by considering the flat array.

Roland Smith Over a year ago

I totally overlooked that it was a 2D array. Oops.

user1764386 Over a year ago

Is there a way to adjust the code to give me both indices (since it's a 2D array). Or is adding another loop the only way? Thank you for the very thorough reply.

user1764386 Over a year ago

Fantastic. One last question. Is there a way to specify the number of decimal points to use when writing the values to the file/s? For example, using np.savetxt, I could specify to use four decimal places. Is there an easy way to accomplish this?

Roland Smith Over a year ago

Yes. Using the format string in the outf.write call, like "{:d} {:d} {:.4f}".format(r, k, a[r,k])

Community · Accepted Answer · 2017-05-23 12:13:49Z

3

Unless you are sure you don't need to worry about endianness and such, best use numpy.savez, as explained in @unutbu's answer and @jorgeca's comment here: numpy's tostring/fromstring --- what do I need to specify to restore the array.

If the resulting size is not small enough, there's always zlib (on python's side: import zlib, on the C++ side, I'm sure an implementation exists).

An alternative would be to use hdf5 format: while it does not necessarily reduce the on-disk file size, it does make saving/loading faster (this is what the format was designed for, large data arrays). There are both python and C++ readers/writers for hdf5.

edited May 23, 2017 at 12:13

CommunityBot

11 silver badge

answered Mar 12, 2013 at 19:44

ev-br

26.3k9 gold badges68 silver badges84 bronze badges

2 Comments

user1764386 Over a year ago

I seem to be misunderstanding something here. Using numpy.savez saves a zip of my array, but it is not compressed and thus not any smaller. Is there an advantage to doing that instead of specifying a .gz extension in numpy.savetxt (which compresses the file to ~100kB)? I greatly appreciate the help.

ev-br Over a year ago

The advantage of save/savez is mainly portability. If you are sure you'll only load your files on the same architecture they were saved, you probably don't need to bother with these. The hdf5 though is still a better option IMO --- unless what you're doing is not a throw-away one-timer.

Dave · Accepted Answer · 2013-03-12 20:20:17Z

1

numpy.ndarray.tofile and numpy.fromfile are useful for direct binary output/input from python. std::ostream::write std::istream::read are useful for binary output/input in c++.

You should be careful about endianess if the data are transferred from one machine to another.

edited Mar 12, 2013 at 20:20

answered Mar 12, 2013 at 19:18

Dave

8,23811 gold badges55 silver badges101 bronze badges

5 Comments

shx2 Over a year ago

Can you explain how to read a file written using ndarray.tofile in C++?

user1764386 Over a year ago

When using ndarray.tofile() my resulting file is actually slightly larger than if I use numpy.savetxt. Is there an argument I might be missing to tell it to output pure binary?

ev-br Over a year ago

Endianness: stackoverflow.com/questions/13672597/…

Dave Over a year ago

@user1764386 Does the array have special properties? e.g. is it all zeros or something else that has a concise text representation?

user1764386 Over a year ago

There are a lot of zeros (probably 70% of the values in the numpy array are 0.0000) I am not sure of how I can somehow exploit this though and generate a tiny file that my c++ program can read in.

Luca Fiaschi · Accepted Answer · 2013-10-07 14:02:46Z

1

Use the an hdf5 file, they are really simple to use through h5py and you can use set compression a flag. Note that hdf5 has also a c++ interface.

answered Oct 7, 2013 at 14:02

Luca Fiaschi

3,2157 gold badges34 silver badges47 bronze badges

Comments

shx2 · Accepted Answer · 2013-03-12 19:16:27Z

0

If you don't mind installing additional packages (for both python and c++), you can use [BSON][1] (Binary JSON).

answered Mar 12, 2013 at 19:16

shx2

64.8k17 gold badges139 silver badges166 bronze badges

1 Comment

user395760 Over a year ago

BSON files are, even according to the authors, rarely smaller than equivalent JSON file. It has its merits, but saving space is not among them.

Collectives™ on Stack Overflow

python - saving numpy array to a file (smallest size possible)

5 Answers 5

5 Comments

2 Comments

5 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

5 Comments

2 Comments

5 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related