1

I am trying to use python re to find binary substrings but I get a somewhat puzzling error.

Here is a small example to demonstrate the issue (python3):

import re

memory = b"\x07\x00\x42\x13"

query1 = (7).to_bytes(1, byteorder="little", signed=False)
query2 = (42).to_bytes(1, byteorder="little", signed=False)

# Works
for match in re.finditer(query1, memory):
    print(match.group(0))

# Causes error
for match in re.finditer(query2, memory):
    print(match.group(0))

The first loop correctly prints b'\x07' while the second gives the following error:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python3.7/re.py", line 230, in finditer
    return _compile(pattern, flags).finditer(string)
  File "/usr/lib/python3.7/re.py", line 286, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.7/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.7/sre_parse.py", line 930, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.7/sre_parse.py", line 426, in _parse_sub
    not nested and not items))
  File "/usr/lib/python3.7/sre_parse.py", line 651, in _parse
    source.tell() - here + len(this))
re.error: nothing to repeat at position 

For context I am trying to find specific integers within the memory space of a program in similar fashion to tools like cheat engine. It is being done using python scripts within gdb.

-- Note 1 --

I have a suspicion that this may be related to the fact that 42 is representable in ascii as * while 7 is not. For example if you print the query strings you get:

>>> print(query1)
b'\x07'
>>> print(query2)
b'*'

-- Note 2 --

Actually it looks like this is unrelated to whether the string is representable in ascii. If you run:

import re

memory = b"\x07\x00\x42\x13"

for i in range(255):
    query = i.to_bytes(1, byteorder="little", signed=False)

    try:
        for match in re.finditer(query, memory):
            pass
    except:
        print(str(i) + " failed -- as ascii: " + chr(i))

It gives:

40 failed -- as ascii: (
41 failed -- as ascii: )
42 failed -- as ascii: *
43 failed -- as ascii: +
63 failed -- as ascii: ?
91 failed -- as ascii: [
92 failed -- as ascii: \

All of the failed bytes represent characters which are special to re syntax. This makes me think that python re is first printing the query string and then parsing it to do that search. I guess that is not entirely unreasonable but still odd.

Actually in writing this question I've found a solution which is to first wrap the query in re.escape(query) which will insert a \ before each special character but I will still post this question in case it may be helpful to others or if anyone has more to add.

1 Answer 1

1

\x42 is corresponds to *, which is a special regex character. You can instead use

re.finditer(re.escape(query2), memory)

which will escape the query (convert * to \*) and find the character * in the string.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.