Python: What's the best way to unpack a struct array from binary data

Question

I'm parsing a binary file format (OpenType font file). The format is a complex tree of many different struct types, but one recurring pattern is to have an array of records of a particular format. I've written code using struct.unpack to get one record at a time. But I'm wondering if there's a way I'm missing to parse the entire array of records?

The following is an example of unpacked results for one particular kind of record array:

[{'glyphID': 288, 'paletteIndex': 0}, {'glyphID': 289, 'paletteIndex': 1}, {'glyphID': 518, 'paletteIndex': 0}, ...]    list

This is what I'm doing at present: I've created a generic function to unpack an arbitrary records array (consistent record format in any given call).

def tryReadRecordsArrayFromBuffer(buffer, numRecords, format, fieldNames):
    recordLength = struct.calcsize(format)
    array = []
    index = 0
    for i in range(numRecords):
        record = {}
        vals = struct.unpack(format, buffer[index : index + recordLength])
        for k, v in zip(fieldNames, vals):
            record[k] = v
        array.append(record)
        index += recordLength

    return array

The buffer parameter is a byte sequence at least the size of the array, with the first record to be unpacked at the start of the sequence.

The format parameter is a struct format string, according to the type of record array being read. In one case, the format string might be ">3H"; in another case, it might be ">4s2H"; etc. For the above example of results, it was ">2H".

The fieldNames parameter is a sequence of strings for the field names in the given record type. For the above example of results, this was ("glyphID", "paletteIndex").

So, I'm stepping through the buffer (byte sequence data), getting sequential slices and unpacking the records one at a time, creating a dict for each record and appending them to the array list.

Is there a better way to do this, a method like unpack in some module that allows defining a format as an array of structs and unpacking the whole shebang at once?

There's, but unpack() seems pretty good solution. Just small suggestion - array.append(dict(zip(fieldNames, vals))). Initialization of empty dict, loop and appending new value to list could be done in one line. — Olvin Roght
– Olvin Roght, Commented Jun 10, 2020 at 4:09
I don't know of a way to do it in one call. And it would need to loop anyway. There are tools like boost or even cython that understand C structures - but they require heavy lifting themselves. One routine that does it all in one shot is the one you're showing here. — tdelaney
– tdelaney, Commented Jun 10, 2020 at 4:10

Barak Itkin · Accepted Answer · 2020-06-10 16:43:52Z

1

Take a look at kaitai - https://kaitai.io/, a library for parsing binary files across multiple languages, with a skeleton to define the file format in a language independent way.

It is capable of defining conditions inside the file format, and adapt the parsing as needed. While the learning curve isn't immediately trivial, it's not too hard either.

Assuming you want to do it yourself and not use an external library, there are a few things to consider that can improve the perforamce/code:

Use struct.unpack_from(format, buffer, offset=0) instead of the current method, as buffer[index : index + recordLength] is possibly creating new objects and copying memory around which is not necessary

If you want to unpack an array of the same format, you can improve it further with struct.iter_unpack(format, buffer) and then iterate over the results:

import itertools
import struct

def tryReadRecordsArrayFromBuffer(buffer, numRecords, format, fieldNames):
    unpack_iter = struct.iter_unpack(buffer, format)
    return [
        # I like this better than dict(zip(...)) but you can also do that
        {k: v for k, v in zip(fieldNames, vals)}
        # We use `islice` to only take the first numRecords values
        for vals in itertools.islice(unpack_iter, numRecords)
    ]

edited Jun 10, 2020 at 16:43

answered Jun 10, 2020 at 6:32

Barak Itkin

4,8971 gold badge24 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Peter Constable Over a year ago

Thanks for the suggestion. Kaitai looks like it could be very useful, and I might look into it at some point. But my immediate purpose is learning Python (the parser is a learning project). I don't want to shift to learning Kaitai just yet. Also, it's an answer to a higher level question: What modules can I use to parse a binary file? It might not be the best choice in some situations (e.g., it might be overkill). My question was narrower: given that I'm writing my own binary parser, what's the best way to parse an array of records?

Barak Itkin Over a year ago

@PeterConstable - Added a solution which utilizes the struct module better. Not tested but it should work :)

Peter Constable Over a year ago

Thanks! I had considered that repeated slicing is repeated copying. I had wondered if a memoryview would avoid that. I wasn't aware of struct.unpack_from, which seems like it would also avoid that issue. And I also wasn't aware of struct.iter_unpack, which seems like the very thing I was looking for. Two things I've learned about :-) Could you clarify two things for me: Do I understand correctly that use for itertools.islice would be to allow selecting a slice from the array, but not needed if buffer is exactly the length of the array? Also, why do you prefer the list comprehension?

Barak Itkin Over a year ago

@PeterConstable - itertools.islice allows you to get the elements at a specific range of an iterable, and I used it to limit iteration on iter_unpack to more than numRecords values. Indeed if you would only operate on a buffer of size numRecords * recordLength then you won't need it :) Regarding list comprehension, I only used it because the produced code is shorter, not for any technical reason.

Collectives™ on Stack Overflow

Python: What's the best way to unpack a struct array from binary data

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related