3

I'm parsing a binary file format (OpenType font file). The format is a complex tree of many different struct types, but one recurring pattern is to have an array of records of a particular format. I've written code using struct.unpack to get one record at a time. But I'm wondering if there's a way I'm missing to parse the entire array of records?

The following is an example of unpacked results for one particular kind of record array:

[{'glyphID': 288, 'paletteIndex': 0}, {'glyphID': 289, 'paletteIndex': 1}, {'glyphID': 518, 'paletteIndex': 0}, ...]    list

This is what I'm doing at present: I've created a generic function to unpack an arbitrary records array (consistent record format in any given call).

def tryReadRecordsArrayFromBuffer(buffer, numRecords, format, fieldNames):
    recordLength = struct.calcsize(format)
    array = []
    index = 0
    for i in range(numRecords):
        record = {}
        vals = struct.unpack(format, buffer[index : index + recordLength])
        for k, v in zip(fieldNames, vals):
            record[k] = v
        array.append(record)
        index += recordLength

    return array

The buffer parameter is a byte sequence at least the size of the array, with the first record to be unpacked at the start of the sequence.

The format parameter is a struct format string, according to the type of record array being read. In one case, the format string might be ">3H"; in another case, it might be ">4s2H"; etc. For the above example of results, it was ">2H".

The fieldNames parameter is a sequence of strings for the field names in the given record type. For the above example of results, this was ("glyphID", "paletteIndex").

So, I'm stepping through the buffer (byte sequence data), getting sequential slices and unpacking the records one at a time, creating a dict for each record and appending them to the array list.

Is there a better way to do this, a method like unpack in some module that allows defining a format as an array of structs and unpacking the whole shebang at once?

2
  • There's, but unpack() seems pretty good solution. Just small suggestion - array.append(dict(zip(fieldNames, vals))). Initialization of empty dict, loop and appending new value to list could be done in one line. Commented Jun 10, 2020 at 4:09
  • I don't know of a way to do it in one call. And it would need to loop anyway. There are tools like boost or even cython that understand C structures - but they require heavy lifting themselves. One routine that does it all in one shot is the one you're showing here. Commented Jun 10, 2020 at 4:10

1 Answer 1

1

Take a look at kaitai - https://kaitai.io/, a library for parsing binary files across multiple languages, with a skeleton to define the file format in a language independent way.

It is capable of defining conditions inside the file format, and adapt the parsing as needed. While the learning curve isn't immediately trivial, it's not too hard either.


Assuming you want to do it yourself and not use an external library, there are a few things to consider that can improve the perforamce/code:

  1. Use struct.unpack_from(format, buffer, offset=0) instead of the current method, as buffer[index : index + recordLength] is possibly creating new objects and copying memory around which is not necessary
  2. If you want to unpack an array of the same format, you can improve it further with struct.iter_unpack(format, buffer) and then iterate over the results:

    import itertools
    import struct
    
    def tryReadRecordsArrayFromBuffer(buffer, numRecords, format, fieldNames):
        unpack_iter = struct.iter_unpack(buffer, format)
        return [
            # I like this better than dict(zip(...)) but you can also do that
            {k: v for k, v in zip(fieldNames, vals)}
            # We use `islice` to only take the first numRecords values
            for vals in itertools.islice(unpack_iter, numRecords)
        ]
    
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for the suggestion. Kaitai looks like it could be very useful, and I might look into it at some point. But my immediate purpose is learning Python (the parser is a learning project). I don't want to shift to learning Kaitai just yet. Also, it's an answer to a higher level question: What modules can I use to parse a binary file? It might not be the best choice in some situations (e.g., it might be overkill). My question was narrower: given that I'm writing my own binary parser, what's the best way to parse an array of records?
@PeterConstable - Added a solution which utilizes the struct module better. Not tested but it should work :)
Thanks! I had considered that repeated slicing is repeated copying. I had wondered if a memoryview would avoid that. I wasn't aware of struct.unpack_from, which seems like it would also avoid that issue. And I also wasn't aware of struct.iter_unpack, which seems like the very thing I was looking for. Two things I've learned about :-) Could you clarify two things for me: Do I understand correctly that use for itertools.islice would be to allow selecting a slice from the array, but not needed if buffer is exactly the length of the array? Also, why do you prefer the list comprehension?
@PeterConstable - itertools.islice allows you to get the elements at a specific range of an iterable, and I used it to limit iteration on iter_unpack to more than numRecords values. Indeed if you would only operate on a buffer of size numRecords * recordLength then you won't need it :) Regarding list comprehension, I only used it because the produced code is shorter, not for any technical reason.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.