0

I need my code to work both on Linux and Windows. I have a binary file which contains a text header with Date and Time information in it which I'd like to extract. An example of the extracted part (ie. how the information is saved in the txt header) is in the commented part of the code. The entire code is written in Python so I'd like to have this extraction also done in Python. In Linux, I'd simply use subprocess and grep (ref):

import subprocess
hosts = subprocess.check_output("grep -E -a 'Date' /path/Bckgrnd.bip", shell=True)
sentence = hosts.decode('utf-8')
# '----------------------------  Date:09/09/2020   Time:11:26:19  ----------------------------\n  Capture Time/Date:\t11:26:17 on 09/09/2020\n----------------------------  Date:09/09/2020   Time:11:26:19  ----------------------------\n'

date = sentence[sentence.index('Date:')+5:sentence.index('Date:')+13]
time = sentence[sentence.index('Time:')+5:sentence.index('Time:')+13]
print(date, time)
# 09/09/20 11:26:19

The problem is that this is going to fail on Windows. An alternative is to load the file in Python:

file_input = /path/Bckgrnd.bip
with open(file_input, 'rb') as f:
    s = f.read()
print(s.find(b'Date'))
# 498
date = s[s.find(b'Date')+5:s.find(b'Date')+13].decode('utf-8')
time = s[s.find(b'Time')+5:s.find(b'Time')+13].decode('utf-8')
print(date, time)

That has one main issues. It has to read the entire file into memory and if the file is large, that is a problem. Is there a way how to go around the OS issues with grep? Is there an alternative to it in pure python without loading the entire binary?

Update: Regarding speed -- I believe grep is faster than pure Python so having it there would make it not only memory-wise but also speed-wise better.

Notice that even grep is treating is as a binary (the -a tag as mentioned eg. here).

1 Answer 1

1

You're going to have to search the entire file regardless, even grep does that. However, you don't have to load the entire file into memory, you can just search one line at a time.

file_input = '/path/Bckgrnd.bip'
with open(file_input, 'rb') as f:
    for line in f.readlines():
        if b'Date' in line:
            s = line
            date = s[s.find(b'Date')+5:s.find(b'Date')+13].decode('utf-8')
            time = s[s.find(b'Date')+5:s.find(b'Date')+13].decode('utf-8')
            print(date, time)
            break  # Only break here if you expect exactly one match

You might also be able to improve your date and time extraction with strftime, but I'm not sure exactly the format you're working with so I didn't spend any time trying to do that.

You say the file is binary, but you're decoding it as UTF-8, which makes me think it's text. Also using grep makes me think text.

If it really is binary and there aren't many line breaks, then you can read the file one byte at a time.

file_input = '/path/Bckgrnd.bip'
buffer = bytes()
with open(file_input, 'rb') as f:
    buffer = buffer[1:] + f.read(1)
    if buffer == b'Date':
        # Read the next set of however many bytes you need to interpret the date and time

Final point, this isn't going to make your program faster, but it will reduce your memory usage.

Sign up to request clarification or add additional context in comments.

7 Comments

Good points, I will clarify the questions, thanks. And yes, I meant memory, not speed :). Although I believe grep is faster in this than pure python..
Grep will be faster for long files, but for short files, it might not be due to how long it takes to start the process. If you are working with files large enough for it to matter, you can find a grep clone for windows. You could include that binary in your distribution if you really needed to. My advice would be to do some benchmarking. You might find that that the pure python solution is good enough. If not, then you at least are assured that your extra work is justified.
I have written them into a function and run a quick %timeit test, it's surprising to me. My original pure python solution is faster: 2.43 ms ± 67.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) than yours: 17 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each). I have removed the print function, with that it is impossibly slow. Any insides?
Which solution? The lines or bytes? And how big is your input file? It's hard for me to have a lot of certainty without seeing the data you're parsing.
It's hard for me to independently investigate without being able to run it, but I can guess. My line solution performs a search (in) for each line, and then replicates your find calls. Loading the file in all at once prevents the extra searches, and you probably pay for it in start up costs. Better to search 1M once than 1K 1000 times. For only a few MB files, I wouldn't stress about loading it all into memory, that's not a lot for modern systems. Reducing memory usually comes at a speed cost, but it seems like you don't need to pay it.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.