Convert data faster (from byte to 3D numpy array)

Question

I have to read a binary file which contains 1300 images of 320*256 of uint8 pixels and convert this to a numpy array. Data convert from byte with struct.unpack is on the following form : b'\xbb\x17\xb4\x17\xe2\x17\xc3\x17\xd3\x17'. The saved data is on the following form:

Main header / Frame header1 / Frame1 / Frame header2  / Frame2 / etc.

Sorry I can't give you the file.

EDIT : new version of the code (3Go during manipulation, 1,5Go use in RAM at final) -- Thanks to Paul

import struct, numpy as np, matplotlib.pyplot as plt
filename = 'blabla'
with open(filename, mode="rb") as f:
    # Initialize variables
    width = 320
    height = 256
    frame_nb_octet = width * height * 2
    count_frame = 1300
    fmt = "<" + "H" * width * height  # little endian and unsigned short
    main_header_size = 4000
    frame_header_size = 100
    data = []
    tab = []

    # Read all images (<=> all the file to read once)
    data.append(f.read())
    data = data[0]

    # -------------- BEFORE --------------
    # # Convert bytes into int (be careful to pass main/fram headers)
    # for indice in range(count_frame):
    #     ind_start = main_header_size + indice * (frame_header_size + frame_nb_octet) + frame_header_size
    #     ind_end = ind_start + frame_nb_octet
    #     tab.append(struct.unpack(fmt, data[ind_start:ind_end]))
    # images = np.resize(np.array(tab), (count_frame, height, width))
    # ------------------------------------

    # Convert bytes into float (because after, mean, etc) passing main/frame headers
    dt = np.dtype(np.uint16)
    dt = dt.newbyteorder(('<'))
    array = np.empty((frame_nb_octet, count_frame), dtype=float)
    for indice in range(count_frame):
        offset = main_header_size + indice * (frame_header_size + frame_nb_octet) + frame_header_size
        array[:, indice] = np.frombuffer(data, dtype=dt, count=frame_nb_octet, offset=offset)
    array = np.resize(array, (height, width, count_frame))

    # Plotting first image to verify data
    fig = plt.figure()
    # plt.imshow(np.squeeze(images[0, :, :]))
    plt.imshow(np.squeeze(array[:, :, 0]))
    plt.show()

Performances:

Before: 4Go RAM and 10 seconds
After first edit : 3Go RAM during manipulation, 1.5Go final, and 4 seconds

Is there other way to convert faster my data, or using less RAM ?

Thank you in advance for your help/advice.

You most probably needn't use struct.unpack try np.frombuffer(buf, dtype) directly on the bytes object. — Paul Panzer
– Paul Panzer, Commented Jan 19, 2018 at 15:50
Your solution is faster yes. I edited my post with the new version. Still 3Go in RAM during the reading so I have to be aware of this to check memory before reading. Other idea ? :D — Mathieu Gauquelin
– Mathieu Gauquelin, Commented Jan 22, 2018 at 9:39

user7138814 · Accepted Answer · 2018-01-25 22:22:36Z

1

Try a memory map:

dtype = [('headers', np.void, frame_header_size), ('frames', '<u2', (height, width))]
mmap = np.memmap(filename, dtype, offeset=main_header_size)
array = mmap['frames']

You can convert it to floating point with .astype if needed.

Actually, to be less cryptic, the clever thing here is using a "structured array", not so much the memory map. You can read about structured arrays in these numpy docs. The trick then becomes choosing a dtype that exactly mathes the format of the data.

We can skip the main header by choosing an offset for the memory map. As an alternative we could have done it like this:

fh = open(filename, 'rb')
fh.seek(main_header_size)
data = np.fromfile(fh, our_structured_dtype)

That leaves the frame data and frame headers. Luckily every frame and frame header has the same size, so we can describe them with a structured dtype. We're not really interested in the frame headers so we give them a void dtype of the specified size. For the data itself we have height * width values, for which we use a convenient sub-array format. We use typestring <u2 to specify "little-endian unsigned short", see numpy docs on data types. Now numpy has all info it needs to read the data in exactly the right format.

Basically, with a strcutured dtype you can describe data layout of a numpy array into fine detail. And then with np.memmap or np.fromfile you can load data in this format from disk.

edited Jan 25, 2018 at 22:22

answered Jan 22, 2018 at 18:49

user7138814

2,05112 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Mathieu Gauquelin Over a year ago

Sorry for the delay. The official documentation of numpy memmap don't mention your way to do. Can you explain a little more what your are doing? because in my binary file I have to skip a frame header between each frame and I don't know how to apply it with your solution. Thanks in advance

Mathieu Gauquelin Over a year ago

I try your code: 1ms, 250Mo in RAM. Just: OMG thank you. Can you explain how it is working ? Just to understand and replicate on another example later if needed.

user7138814 Over a year ago

I added an explaination. Note that memmap appears to very fast, because it doesn't actually load any data. It just creates a mapping from a location on disk to addresses in RAM. Only when you try to use the data, e.g. for .astype(float) it gets pulled from disk transparently.

Mathieu Gauquelin Over a year ago

Thank you very much. Just to be sure, when you say "get pull", it means Python is doing a copy of the original data?

user7138814 Over a year ago

No the memory map is not a copy, it's still the same file as the one on disk, just accessed in a special way as if it were RAM. Any change that is made to the memmap will be written to disk eventually (unless the operating system crashes).

|

Collectives™ on Stack Overflow

Convert data faster (from byte to 3D numpy array)

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related