Efficient reading of values in a binary file

Question

I have a lot of binary files containing output of a numerical model. They are flat binary files containing output as floating point numbers. The files correspond to a four-dimensional array sorted in t-z-y-x order with x varying fastest. The thing is that for given x,y and z I need the values for all t. The simple solution of simply reading everything into one large numpy array and taking data[:,z,y,x] works of course but is not very efficient (I need to read many files).

What I have come up with now is the following (assuming start_index and volume_size to represent the correct things):

data=array.array('f')
with file(my_filename,'rb') as infile:
    for hour in range(amount_of_steps):
        if hour==0:
            infile.seek(start_index*data.itemsize,0)
        else:
            infile.seek(data.itemsize*volume_size,1)
        data.fromfile(infile,1)

I don't have to bother about enddianness and portability (although the latter has of course some merits). The whole thing runs on Linux and it is highly unlikely it will ever run on something else. So the question is: Is there a way to do this with higher performance? This is done on many files. I tried parallelization but it does not really help. Getting new hardware is not an option and SSDs even less so because of the amount of data involved. Neither is changing the file-format.

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

Possible options could involve

to use mmap.

With this, you map a file into the memory area, making its contents accessible as if it were in RAM. The components are read in as soon as they are accessed/needed, probably in the normal page size of the OS (4 kiB).
to read the complete file into memory. This does essentially the same as mmap, but without help from the OS. OTOH, it can be done in one run instead of in 4 kiB steps.

If you have the data in RAM (in a file), you can use StringIO to emulate a file again and to feed the array.fromfile() with it.

Having had a second glance on it, you can omit the StringIO stuff and use array.fromstring() instead.

Using only one read (or some few) should normally be faster than repetitive infile.seek() and data.fromfile(infile,1) calls, especially if you read only one value per call. (Except maybe if your step size (volume_size) is sufficiently big - skipping about several hundreds to thousands of bytes - then it COULD be faster to do it your way...)

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Nov 2, 2011 at 14:02

glglgl

91.5k13 gold badges157 silver badges230 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Christoph Over a year ago

I haven't got the hang of mmap yet and do not exactly know what StringIO is for. Maybe you could elaborate a bit? Reading the whole file is slower in my experience. The files are not huge (something around 10MB) but I may have to work through thousands of them.

glglgl Over a year ago

Edited my answer in order to be more precise.

Christoph Over a year ago

Thank you. I will look into that. The point is, I am skipping about half a megabyte with the infile.seek(), so yes, my step size is very large.

glglgl Over a year ago

So you have about 20 read() calls, each reading a few bytes. In this case, forget what I wrote - your approach should be faster then.

Christoph Over a year ago

Having tried it, it does not really matter with my approach having a slight edge. Which leads me to the assumption that I am doing more or less the best I can. Thanks for the explanation though, this was interesting.

|

NPE · Accepted Answer · 2011-11-02 14:10:17Z

1

If I were you, I'd take a look at numpy.memmap. I've used it in the past for a problem similar to yours, with good results.

answered Nov 2, 2011 at 14:10

NPE

503k114 gold badges970 silver badges1k bronze badges

1 Comment

Christoph Over a year ago

Thanks. This does not appear to be faster but produces nicer code.

Collectives™ on Stack Overflow

Efficient reading of values in a binary file

2 Answers 2

6 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related