6

Right now I have a python program building a fairly large 2D numpy array and saving it as a tab delimited text file using numpy.savetxt. The numpy array contains only floats. I then read the file in one row at a time in a separate C++ program.

What I would like to do is find a way to accomplish this same task, changing my code as little as possible such that I can decrease the size of the file I am passing between the two programs.

I found that I can use numpy.savetxt to save to a compressed .gz file instead of a text file. This lowers the file size from ~2MB to ~100kB.

Is there a better way to do this? Could I, perhaps, write the numpy array in binary to the file to save space? If so, how would I do this so that I can still read it into the C++ program?

Thank you for the help. I appreciate any guidance I can get.

EDIT:

There are a lot of zeros (probably 70% of the values in the numpy array are 0.0000) I am not sure of how I can somehow exploit this though and generate a tiny file that my c++ program can read in

1
  • Just a thought - do you have to write it out at all? If the programs are running concurrently (or can be made to run concurrently), you can use any of various methods to stream the data from one to the other: a named pipe, a TCP socket, shared memory, etc. Commented Mar 12, 2013 at 20:59

5 Answers 5

3

Since you have a lot of zeroes, you could only write out the non-zero elements in the form (index, number).

Suppose you have an array with a small amount of nonzero numbers:

In [5]: a = np.zeros((10, 10))

In [6]: a
Out[6]: 
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

In [7]: a[3,1] = 2.0

In [8]: a[7,4] = 17.0

In [9]: a[9,0] = 1.5

First, isolate the interesting numbers and their indices:

In [11]: x, y = a.nonzero()

In [12]: zip(x,y)
Out[12]: [(3, 1), (7, 4), (9, 0)]

In [13]: nonzero = zip(x,y)

Now you only have a small number of data elements left. The easiest thing is to write them to a text file:

In [17]: with open('numbers.txt', 'w+') as outf:
   ....:     for r, k in nonzero:
   ....:         outf.write('{:d} {:d} {:g}\n'.format(r, k, a[r,k]))
   ....:         

In [18]: cat numbers.txt
3 1 2
7 4 17
9 0 1.5

This also gives you an opportunity to eyeball the data. In your C++ program you can read this data with fscanf.

But you can reduce the size even more by writing binary data using struct:

In [17]: import struct

In [19]: c = struct.Struct('=IId')

In [20]: with open('numbers.bin', 'w+') as outf:
   ....:     for r, k in nonzero:
   ....:         outf.write(c.pack(r, k, a[r,k]))

The argument to the Struct constructor means; use native date format '='. The first and second data elements are unsigned integers 'I', the third element is a double 'd'.

In your C++ program this data is probably best read as binary data into a packed struct.

EDIT: Answer updated for a 2D array.

Sign up to request clarification or add additional context in comments.

5 Comments

Although the array is 2-dimensional, you can still use a single index to refer to the non-zero elements by considering the flat array.
I totally overlooked that it was a 2D array. Oops.
Is there a way to adjust the code to give me both indices (since it's a 2D array). Or is adding another loop the only way? Thank you for the very thorough reply.
Fantastic. One last question. Is there a way to specify the number of decimal points to use when writing the values to the file/s? For example, using np.savetxt, I could specify to use four decimal places. Is there an easy way to accomplish this?
Yes. Using the format string in the outf.write call, like "{:d} {:d} {:.4f}".format(r, k, a[r,k])
3

Unless you are sure you don't need to worry about endianness and such, best use numpy.savez, as explained in @unutbu's answer and @jorgeca's comment here: numpy's tostring/fromstring --- what do I need to specify to restore the array.

If the resulting size is not small enough, there's always zlib (on python's side: import zlib, on the C++ side, I'm sure an implementation exists).

An alternative would be to use hdf5 format: while it does not necessarily reduce the on-disk file size, it does make saving/loading faster (this is what the format was designed for, large data arrays). There are both python and C++ readers/writers for hdf5.

2 Comments

I seem to be misunderstanding something here. Using numpy.savez saves a zip of my array, but it is not compressed and thus not any smaller. Is there an advantage to doing that instead of specifying a .gz extension in numpy.savetxt (which compresses the file to ~100kB)? I greatly appreciate the help.
The advantage of save/savez is mainly portability. If you are sure you'll only load your files on the same architecture they were saved, you probably don't need to bother with these. The hdf5 though is still a better option IMO --- unless what you're doing is not a throw-away one-timer.
1

numpy.ndarray.tofile and numpy.fromfile are useful for direct binary output/input from python. std::ostream::write std::istream::read are useful for binary output/input in c++.

You should be careful about endianess if the data are transferred from one machine to another.

5 Comments

Can you explain how to read a file written using ndarray.tofile in C++?
When using ndarray.tofile() my resulting file is actually slightly larger than if I use numpy.savetxt. Is there an argument I might be missing to tell it to output pure binary?
@user1764386 Does the array have special properties? e.g. is it all zeros or something else that has a concise text representation?
There are a lot of zeros (probably 70% of the values in the numpy array are 0.0000) I am not sure of how I can somehow exploit this though and generate a tiny file that my c++ program can read in.
1

Use the an hdf5 file, they are really simple to use through h5py and you can use set compression a flag. Note that hdf5 has also a c++ interface.

Comments

0

If you don't mind installing additional packages (for both python and c++), you can use [BSON][1] (Binary JSON).

1 Comment

BSON files are, even according to the authors, rarely smaller than equivalent JSON file. It has its merits, but saving space is not among them.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.