Efficiently processing large binary files in python

Question

I'm currently reading binary files that are 150,000 kb each. They contain roughly 3,000 structured binary messages and I'm trying to figure out the quickest way to process them. Out of each message, I only need to actually read about 30 lines of data. These messages have headers that allow me to jump to specific portions of the message and find the data I need.

I'm trying to figure out whether it's more efficient to unpack the entire message (50 kb each) and pull my data from the resulting tuple that includes a lot of data I don't actually need, or would it cost less to use seek to go to each line of data I need for every message and unpack each of those 30 lines? Alternatively, is this something better suited to mmap?

What do you mean 30 "lines"? The data is binary, so lines don't make much sense. Can you put that in terms of a percentage of each message? Also unless the percentage is near 100% or 0%, you'll probably have to profile to get a useful answer. — bnaecker
– bnaecker, Commented Mar 5, 2018 at 21:57
Sorry, you're right, that wasn't clear at all. Thirty 8 byte segments of binary. — AEvers
– AEvers, Commented Mar 15, 2018 at 12:11
And how are they distributed throughout the message? Are they randomly placed, or all in one region, or something in between? — bnaecker
– bnaecker, Commented Mar 15, 2018 at 14:50
They follow a set structure, although, while the messages are consistently sized between messages, they may vary from file to file. My plan had been to read headers for the messages to determine the size and build a format string to unpack the entire message, then pull the data from the tuple. Alternatively, I can use the message headers to find out how many bytes I need to skip to reach the part of the message I want to read and then I can unpack that single piece of binary data to retrieve the variables. — AEvers
– AEvers, Commented Mar 16, 2018 at 12:51
I'm just not sure if skipping through the message to unpack 30 integers will be slower than a single unpack operation unpacking several hundred integers. — AEvers
– AEvers, Commented Mar 16, 2018 at 12:53

Davis Herring · Accepted Answer · 2018-10-11 01:12:05Z

1

Seeking, possibly several times, within just 50 kB is probably not worthwhile: system calls are expensive. Instead, read each message into one bytes and use slicing to “seek” to the offsets you need and get the right amount of data.

It may be beneficial to wrap the bytes in a memoryview to avoid copying, but for small individual reads it probably doesn’t matter much. If you can use a memoryview, definitely try using mmap, which exposes a similar interface over the whole file. If you’re using struct, its unpack_from can already seek within a bytes or an mmap without wrapping or copying.

edited Oct 11, 2018 at 1:12

answered Oct 11, 2018 at 0:32

Davis Herring

41.9k4 gold badges58 silver badges91 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Efficiently processing large binary files in python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related