I am trying to use python re to find binary substrings but I get a somewhat puzzling error.
Here is a small example to demonstrate the issue (python3):
import re
memory = b"\x07\x00\x42\x13"
query1 = (7).to_bytes(1, byteorder="little", signed=False)
query2 = (42).to_bytes(1, byteorder="little", signed=False)
# Works
for match in re.finditer(query1, memory):
print(match.group(0))
# Causes error
for match in re.finditer(query2, memory):
print(match.group(0))
The first loop correctly prints b'\x07' while the second gives the following error:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python3.7/re.py", line 230, in finditer
return _compile(pattern, flags).finditer(string)
File "/usr/lib/python3.7/re.py", line 286, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python3.7/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib/python3.7/sre_parse.py", line 930, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "/usr/lib/python3.7/sre_parse.py", line 426, in _parse_sub
not nested and not items))
File "/usr/lib/python3.7/sre_parse.py", line 651, in _parse
source.tell() - here + len(this))
re.error: nothing to repeat at position
For context I am trying to find specific integers within the memory space of a program in similar fashion to tools like cheat engine. It is being done using python scripts within gdb.
-- Note 1 --
I have a suspicion that this may be related to the fact that 42 is representable in ascii as * while 7 is not. For example if you print the query strings you get:
>>> print(query1)
b'\x07'
>>> print(query2)
b'*'
-- Note 2 --
Actually it looks like this is unrelated to whether the string is representable in ascii. If you run:
import re
memory = b"\x07\x00\x42\x13"
for i in range(255):
query = i.to_bytes(1, byteorder="little", signed=False)
try:
for match in re.finditer(query, memory):
pass
except:
print(str(i) + " failed -- as ascii: " + chr(i))
It gives:
40 failed -- as ascii: (
41 failed -- as ascii: )
42 failed -- as ascii: *
43 failed -- as ascii: +
63 failed -- as ascii: ?
91 failed -- as ascii: [
92 failed -- as ascii: \
All of the failed bytes represent characters which are special to re syntax. This makes me think that python re is first printing the query string and then parsing it to do that search. I guess that is not entirely unreasonable but still odd.
Actually in writing this question I've found a solution which is to first wrap the query in re.escape(query) which will insert a \ before each special character but I will still post this question in case it may be helpful to others or if anyone has more to add.