I am unpacking large binary files (~1GB) with many different datatypes. I am in the early stages of creating the loop to covert each byte. I have been using struct.unpack, but recently thought it would run faster if I utilized numpy. However switching to numpy has slowed down my program. I have tried:
struct.unpack
np.fromfile
np.frombuffer
np.ndarray
note:in the np.fromfile method I leave the file open and don't load it into memory and seek through it
1)
with open(file="file_loc" , mode='rb') as file:
RAW = file.read()
byte=0
len = len(RAW)
while( byte < len):
header = struct.unpack(">HHIH", RAW[byte:(byte+10)])
size = header[1]
loc = str(header[3])
data[loc] = struct.unpack(">B", RAW[byte+10:byte+size-10)
byte+=size
2)
dt=('>u2,>u2,>u4,>u2')
with open(file="file_loc" , mode='rb') as RAW:
same loop as above:
header = np.fromfile(RAW[byte:byte+10], dtype=dt, count=1)[0]
data = np.fromfile(RAW[byte+10:byte+size-10], dtype=">u1", count=size-10)
3)
dt=('>u2,>u2,>u4,>u2')
with open(file="file_loc" , mode='rb') as file:
RAW = file.read()
same loop:
header = np.ndarray(buffer=RAW[byte:byte+10], dtype=dt_header, shape= 1)[0]
data = np.ndarray(buffer=RAW[byte+10:byte+size-10], dtype=">u1", shape=size-10)
4) pretty much the same as 3 except using np.frombuffer()
All of the numpy implementations process at about half the speed as the struct.unpack method, which is not what I expected.
Let me know if there is anything I can do to improve performance.
also, I just typed this from memory so it might have some errors.
fromfileto be any better? Eachfromfilecall is processing the same size block as the correspondingstruct.unpack. I assume thedatablocks are larger than theheaderones. The unpacking is as simple as it gets, one byte per element.ndarray, the processing is quite fast, at least for whole array operations that use compiled code, and various forms of indexing. But creating an array, whether from lists, or from a file isn't necessarily fast. Here file reading could be as big a time consumer as the processing.datablocks are much bigger, and size defined in the headers. I would expect thedata = np.frombuffer()to be at least as fast asdata = struct.unpackunpackstatement should be:struct.unpack( f'{size-10}B', RAW[byte+10: byte+size)