0

I have a lot of binary files containing output of a numerical model. They are flat binary files containing output as floating point numbers. The files correspond to a four-dimensional array sorted in t-z-y-x order with x varying fastest. The thing is that for given x,y and z I need the values for all t. The simple solution of simply reading everything into one large numpy array and taking data[:,z,y,x] works of course but is not very efficient (I need to read many files).

What I have come up with now is the following (assuming start_index and volume_size to represent the correct things):

data=array.array('f')
with file(my_filename,'rb') as infile:
    for hour in range(amount_of_steps):
        if hour==0:
            infile.seek(start_index*data.itemsize,0)
        else:
            infile.seek(data.itemsize*volume_size,1)
        data.fromfile(infile,1)

I don't have to bother about enddianness and portability (although the latter has of course some merits). The whole thing runs on Linux and it is highly unlikely it will ever run on something else. So the question is: Is there a way to do this with higher performance? This is done on many files. I tried parallelization but it does not really help. Getting new hardware is not an option and SSDs even less so because of the amount of data involved. Neither is changing the file-format.

2 Answers 2

1

Possible options could involve

  1. to use mmap.

    With this, you map a file into the memory area, making its contents accessible as if it were in RAM. The components are read in as soon as they are accessed/needed, probably in the normal page size of the OS (4 kiB).

  2. to read the complete file into memory. This does essentially the same as mmap, but without help from the OS. OTOH, it can be done in one run instead of in 4 kiB steps.

If you have the data in RAM (in a file), you can use StringIO to emulate a file again and to feed the array.fromfile() with it.

Having had a second glance on it, you can omit the StringIO stuff and use array.fromstring() instead.

Using only one read (or some few) should normally be faster than repetitive infile.seek() and data.fromfile(infile,1) calls, especially if you read only one value per call. (Except maybe if your step size (volume_size) is sufficiently big - skipping about several hundreds to thousands of bytes - then it COULD be faster to do it your way...)

Sign up to request clarification or add additional context in comments.

6 Comments

I haven't got the hang of mmap yet and do not exactly know what StringIO is for. Maybe you could elaborate a bit? Reading the whole file is slower in my experience. The files are not huge (something around 10MB) but I may have to work through thousands of them.
Edited my answer in order to be more precise.
Thank you. I will look into that. The point is, I am skipping about half a megabyte with the infile.seek(), so yes, my step size is very large.
So you have about 20 read() calls, each reading a few bytes. In this case, forget what I wrote - your approach should be faster then.
Having tried it, it does not really matter with my approach having a slight edge. Which leads me to the assumption that I am doing more or less the best I can. Thanks for the explanation though, this was interesting.
|
1

If I were you, I'd take a look at numpy.memmap. I've used it in the past for a problem similar to yours, with good results.

1 Comment

Thanks. This does not appear to be faster but produces nicer code.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.