0

I'm trying to do a comparison of some byte values - source A comes from a file that is being 'read':

f = open(fname, "rb")
f_data = f.read()
f.close()

These files can be anything from a few Kb to a few Mb large

Source B is a dictionary of known patterns:

eof_markers = {
    'jpg':b'\xff\xd9',
    'pdf':b'\x25\x25\x45\x4f\x46',
    }

(This list will be extended once the basic process works)

Essentially I'm trying to 'read' the file (source A) and then incrementally inspect the last byte for matches to the pattern list testString = f_data[-counter:] If no match is found, it should increase counter by 1, and try to pattern match against the list again.

I've tried a number of different ways to get this working, I can get the testString to increment correctly, but I keep running into encode issue where various approaches are want to ASCIIify the byte to undertake the comparison.

I'm a bit lost, and not for the first time wandering around the code changing int to u to b and not getting past issues like d9 being a reserved value, and therefore not being able to use the ASCII type comparison tools e.g. if format_type in testString: (results in a UnicodeDecodeError: 'ascii' codec can't decode byte a9

I tried to convert everything to an integer, but that was throwing this error: ValueError: invalid literal for int() with base 2: '.' or ValueError: invalid literal for int() with base 10: '.' I tried to convert the testString to hex bytes, but kept getting TypeError: hex() argument can't be converted to hex (this is more my lack of understanding than anything else I'm sure!....)

There are a number of resources I've found that talk about encoding / hex comparisons e.g. stackoverflow.com/questions/10561923/unicodedecodeerror-ascii-codec-cant-decode-byte-0xef-in-position-1), I've just not found something that I can either fully understand, or that points me down the right path.

Its been a while I've been stuck on this, so any pointers are gratefully received.

6
  • What version of python are you using? That will help people answer because I think the final solution is going to be a bit different in python 3.x Commented Sep 25, 2012 at 0:09
  • First, are you sure format_type, etc., are all byte strings? As soon as you try to mix bytes and Unicode, you'll get an immediate error if you're lucky, or an undiagnosable problem one step later if you're not. Commented Sep 25, 2012 at 0:12
  • Second, can you give us a complete minimal example that almost works, except that it throws that UnicodeDecodeError when you don't think it should be doing any decoding? Commented Sep 25, 2012 at 0:13
  • Third, there's no hex-encoded data involved here in any of what you've shown us, just raw binary bytes, so I don't see why you expect hex encoding to be relevant, or the hex function to help. What makes you think it's relevant here? Commented Sep 25, 2012 at 0:15
  • All very valid comments. I'm on python 2.7, format_type can be encoded as they need - at the moment they are str I'll look at complete example, and finally, 3rd, indeed.. I'm at the "try anything" stage.... but thank you for the explanation as to why it will fail. Commented Sep 25, 2012 at 0:35

2 Answers 2

1

I'm not sure exactly what you're trying to do, but I ran this code in Python 3.2.3.

#f = open(fname, "rb")
#f_data = f.read()
#f.close()
f_data = b'\x12\x43\xff\xd9\x00\x23'
eof_markers = {
    'jpg':b'\xff\xd9',
    'pdf':b'\x25\x25\x45\x4f\x46',
    }

for counter in range(-4, 0):
  for name, marker in eof_markers.items():
    print(counter, ('' if marker in f_data[counter:] else '!') + name)

I'm using a hardcoded f_data, but you can undo that by just uncommenting lines 1-3 and comment line 4.

Here's the output:

-4 !pdf
-4 jpg
-3 !pdf
-3 !jpg
-2 !pdf
-2 !jpg
-1 !pdf
-1 !jpg

Is there something this isn't doing that you need to do?

Sign up to request clarification or add additional context in comments.

2 Comments

Sigh. Nope. That's right on the money, thank you! Time to sit down and figure out the voodoo that you've got! Thank you for your time and suggestion. (double bonus points for showing me very concisely how to engage with a dictionary properly too!)
The key here is that I'm never converting anything from bytes to str or mixing up the strs (eof_markers.keys(), '', '!') with the bytess (f_data and eof_markers.values()). The '' if … else '!' is probably unnecessarily tricky (Guido wouldn't like it); if you find that or anything else hard to figure out let me know.
0

I can't figure out how to comment on your main post instead of making a subpost. Anyway, I have answers to some of your questions..

  • int(v) converts a formatted number (eg '599') to an integer, not a character(eg "!") to its integer value. You would want ord() for that. However I see no reason you would need to use either in this situation.

  • Hex != binary. Hex is just a numeric base. Binary is raw byte values that may not be printable depending on their value. This is why they show up as escape codes like "\xfd". That's how Python represents unprintable characters to you -- as hex codes.However they are still single characters with no special status -- they don't need conversion. It's perfectly valid to compare 'A' with '\xfd'. Hence, you should be able to do the comparison without any conversion at all.

  • changing 'u' to 'b' will only have any real effect if you're running Python 3.x

As for directly solving the problem, I feel that while it's clear what you want to do, it's not clear why you have chosen to do things in this way. To get a better answer, you will need to ask a clearer question.

Here's an example of an alternative approach:

# convert eof markers to a list of characters
eof_markers = {k: list(v) for k,v in eof_markers.items()}

# assuming that the bytes you have read in are being added to a list,
# we can then do a check for the entire EOF string by:

# outer loop reading the next byte, etc, omitted.
for mname, marker in eof_markers.items():
    nmarkerbytes = len(marker) 
    enoughbytes = len(bytes_buffer) >= nmarkerbytes
    if enoughbytes and bytes_buffer[-nmarkerbytes:] == marker:
        location = f.tell()
        print ('%s marker found at %d' % (mname, location))

There are other, faster approaches using bytes or bytearray (for example, using the 'rfind' method), but this is the simplest approach to explain.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.