unpacking binary file using struct.unpack VS np.frombuffer VS np.ndarray VS np.fromfile

Question

I am unpacking large binary files (~1GB) with many different datatypes. I am in the early stages of creating the loop to covert each byte. I have been using struct.unpack, but recently thought it would run faster if I utilized numpy. However switching to numpy has slowed down my program. I have tried:

struct.unpack
np.fromfile
np.frombuffer
np.ndarray

note:in the np.fromfile method I leave the file open and don't load it into memory and seek through it

1)

with open(file="file_loc" , mode='rb') as file: 
    RAW = file.read()
byte=0
len = len(RAW)
while( byte < len):
    header = struct.unpack(">HHIH", RAW[byte:(byte+10)])
    size = header[1]
    loc  = str(header[3])
    data[loc] = struct.unpack(">B", RAW[byte+10:byte+size-10)
    byte+=size

2)

dt=('>u2,>u2,>u4,>u2')
with open(file="file_loc" , mode='rb') as RAW:
    same loop as above:
        header = np.fromfile(RAW[byte:byte+10], dtype=dt, count=1)[0]
        data   = np.fromfile(RAW[byte+10:byte+size-10], dtype=">u1", count=size-10)

3)

dt=('>u2,>u2,>u4,>u2')
with open(file="file_loc" , mode='rb') as file:
    RAW = file.read()
same loop:
    header = np.ndarray(buffer=RAW[byte:byte+10], dtype=dt_header, shape= 1)[0]
    data   = np.ndarray(buffer=RAW[byte+10:byte+size-10], dtype=">u1", shape=size-10)

4) pretty much the same as 3 except using np.frombuffer()

All of the numpy implementations process at about half the speed as the struct.unpack method, which is not what I expected.

Let me know if there is anything I can do to improve performance.

also, I just typed this from memory so it might have some errors.

Why do you expect the fromfile to be any better? Each fromfile call is processing the same size block as the corresponding struct.unpack. I assume the data blocks are larger than the header ones. The unpacking is as simple as it gets, one byte per element. — hpaulj
– hpaulj, Commented Feb 13, 2019 at 22:01
In general, once you have created a ndarray, the processing is quite fast, at least for whole array operations that use compiled code, and various forms of indexing. But creating an array, whether from lists, or from a file isn't necessarily fast. Here file reading could be as big a time consumer as the processing. — hpaulj
– hpaulj, Commented Feb 13, 2019 at 22:06
yes, the data blocks are much bigger, and size defined in the headers. I would expect the data = np.frombuffer() to be at least as fast as data = struct.unpack — drew
– drew, Commented Feb 13, 2019 at 22:26
I think the unpack statement should be: struct.unpack( f'{size-10}B', RAW[byte+10: byte+size) — hpaulj
– hpaulj, Commented Feb 14, 2019 at 1:36

hpaulj · Accepted Answer · 2019-02-14 00:16:04Z

I haven't used struct much, but between your code and docs I got it to work on a buffer that stores an array of integers.

Make a byte array/string from a numpy array.

In [81]: arr = np.arange(1000)
In [82]: barr = arr.tobytes()
In [83]: type(barr)
Out[83]: bytes
In [84]: len(barr)
Out[84]: 8000

The reverse is tobytes:

In [85]: x = np.frombuffer(barr, dtype=int)
In [86]: x[:10]
Out[86]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [87]: np.allclose(x,arr)
Out[87]: True

ndarray also works, though the direct use of this constructor is usually discouraged:

In [88]: x = np.ndarray(buffer=barr, dtype=int, shape=(1000,))
In [89]: np.allclose(x,arr)
Out[89]: True

To use struct I need to create a format that includes the length, "1000 long":

In [90]: tup = struct.unpack('1000l', barr)
In [91]: len(tup)
Out[91]: 1000
In [92]: tup[:10]
Out[92]: (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
In [93]: np.allclose(np.array(tup),arr)
Out[93]: True

So now that we've established equivalent methods of reading the buffer, do some timings:

In [94]: timeit x = np.frombuffer(barr, dtype=int)
617 ns ± 0.806 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [95]: timeit x = np.ndarray(buffer=barr, dtype=int, shape=(1000,))
1.11 µs ± 1.76 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [96]: timeit tup = struct.unpack('1000l', barr)
19 µs ± 38.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [97]: timeit tup = np.array(struct.unpack('1000l', barr))
87.5 µs ± 25.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

frombuffer looks pretty good.

Your struct.unpack loop confuses me. I don't think it's doing the same thing as the frombuffer. But like said at the start, I haven't used struct much.

Collectives™ on Stack Overflow

unpacking binary file using struct.unpack VS np.frombuffer VS np.ndarray VS np.fromfile

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related