Python re search on binary strings

Question

I am trying to use python re to find binary substrings but I get a somewhat puzzling error.

Here is a small example to demonstrate the issue (python3):

import re

memory = b"\x07\x00\x42\x13"

query1 = (7).to_bytes(1, byteorder="little", signed=False)
query2 = (42).to_bytes(1, byteorder="little", signed=False)

# Works
for match in re.finditer(query1, memory):
    print(match.group(0))

# Causes error
for match in re.finditer(query2, memory):
    print(match.group(0))

The first loop correctly prints b'\x07' while the second gives the following error:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python3.7/re.py", line 230, in finditer
    return _compile(pattern, flags).finditer(string)
  File "/usr/lib/python3.7/re.py", line 286, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.7/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.7/sre_parse.py", line 930, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.7/sre_parse.py", line 426, in _parse_sub
    not nested and not items))
  File "/usr/lib/python3.7/sre_parse.py", line 651, in _parse
    source.tell() - here + len(this))
re.error: nothing to repeat at position

For context I am trying to find specific integers within the memory space of a program in similar fashion to tools like cheat engine. It is being done using python scripts within gdb.

-- Note 1 --

I have a suspicion that this may be related to the fact that 42 is representable in ascii as * while 7 is not. For example if you print the query strings you get:

>>> print(query1)
b'\x07'
>>> print(query2)
b'*'

-- Note 2 --

Actually it looks like this is unrelated to whether the string is representable in ascii. If you run:

import re

memory = b"\x07\x00\x42\x13"

for i in range(255):
    query = i.to_bytes(1, byteorder="little", signed=False)

    try:
        for match in re.finditer(query, memory):
            pass
    except:
        print(str(i) + " failed -- as ascii: " + chr(i))

It gives:

40 failed -- as ascii: (
41 failed -- as ascii: )
42 failed -- as ascii: *
43 failed -- as ascii: +
63 failed -- as ascii: ?
91 failed -- as ascii: [
92 failed -- as ascii: \

All of the failed bytes represent characters which are special to re syntax. This makes me think that python re is first printing the query string and then parsing it to do that search. I guess that is not entirely unreasonable but still odd.

Actually in writing this question I've found a solution which is to first wrap the query in re.escape(query) which will insert a \ before each special character but I will still post this question in case it may be helpful to others or if anyone has more to add.

Oskari Mantere · Accepted Answer · 2019-12-29 17:15:50Z

1

\x42 is corresponds to *, which is a special regex character. You can instead use

re.finditer(re.escape(query2), memory)

which will escape the query (convert * to \*) and find the character * in the string.

edited Dec 29, 2019 at 17:15

answered Dec 28, 2019 at 10:32

Oskari Mantere

2922 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python re search on binary strings

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related