1

I want to count the occurrences of a particular header section in a binary file with Python 2.7.3. I have found plenty of examples to count occurrences in .txt type files and to do with lines but little info on counting byte sequences in binaries.

Thoughts are you would use the ASCII characters in the binary to use a string to search for.

The header section in hex is "28 00 28 00 28 00" or "( ( ( " in ascii.

I thought the code would be something like this:

total = 0
for line in f:
    if "( ( ( " in line:
        total += 1
f.close()
print "%s" % total 

But it doesn't even seem to count once, it'll print line and that is 120 chars long.

1 Answer 1

1

You have NULL bytes, not spaces. By using '( ( ( ' are looking for 28 20 28 20 28 20, not 28 00 28 00 28 00.

Use \x00 to create such bytes:

if "(\x00(\x00(\x00" in line:

Looping over a binary file in lines may not make sense; this would only work if there were actually \n bytes in that file.

You could search through the file in chunks rather than lines:

previous = ''
total = 0
for chunk in iter(lambda: f.read(2048), ''):
    if "(\x00(\x00(\x00" in previous + chunk:
        total += 1
    previous = chunk[-5:]  # ensure we don't miss matches at boundaries
Sign up to request clarification or add additional context in comments.

11 Comments

Thanks for that, rookie mistake, with the updated IF statement the total count is still 0. Would bytes be better than to use than "lines" in the FOR statement?
@Python_newbie: so are you 100% certain those byte sequences are there? For binary files, I'd read in chunks (and take the last 5 bytes from the preceding chunk along for the next test, to ensure you didn't miss a partial match).
yes they sure are, I can find every header instance in the Hex Editor "Find selection" search criteria. There's at a guess 1000 x 3 different types of headers so that's why I am wanting a script to count and print a confirmed total. Reading in chunks won't work as the metadata can vary in length that's why searching for the header byte sequence is the best option afaik.
@Python_newbie: and the header doesn't contain any length information then?
I got it to work in the end, basically I used the .count attribute and opened the file in 'rb' mode and then read the entire file in assigned "data" and then went "data.count("(\x00(\x00(\x00") and it returned 1363
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.