2

I want to read all strings from Python file. Example file (/tmp/s.py):

s = '{\x7f5  x'

Now I try to read the string from my script:

import re
find_str = re.compile(r"'(.+?)'")

for line in open('/tmp/s.py', 'r'):
    all_strings = find_str.findall(line)
    print(all_strings) # outputs ['{\\x7f5  x']

But I want the string (in this case the byte that is in escaped hex representation) not to be escaped. I want to treat the data was it is in my /tmp/s.py file and to get a string with a interpreted \x7f byte, instead of the literal \x7f, which is right now represented as \\x7f.

How can I do this?

1 Answer 1

3

You'd use the unicode_escape codec to decode the string the same way Python does when reading a string literal:

print(*[s.encode('latin1').decode('unicode_escape') for s in all_strings])

Note that unicode_escape can only decode from bytes, not from text. The codec is also limited to Latin-1 source code, not the default UTF-8.

From the Text Encodings section of the Python codecs module:

unicode_escape

Encoding suitable as the contents of a Unicode literal in ASCII-encoded Python source code, except that quotes are not escaped. Decodes from Latin-1 source code. Beware that Python source code actually uses UTF-8 by default.

Demo:

>>> s = r'{\x7f5  x'
>>> s
'{\\x7f5  x'
>>> s.encode('latin1').decode('unicode_escape')
'{\x7f5  x'
Sign up to request clarification or add additional context in comments.

2 Comments

This is a very good answer and answered exactly what I tried to formulate. Many thanks. Any idea why the Python devs would prefer latin1 over utf8, when python source code is by default in utf8?
@NikolaiTschacher: I suspect the limitation is a historic one; Python 2 source has traditionally been interpreted as Latin-1 as well. Also, Latin-1 means your bytes are decoded one-on-one to Unicode codepoints, which may be a better choice when dealing with arbitrary strings (you can always decode all bytes to a Unicode codepoint, even if it is the wrong one). You cannot specify the source encoding here as you are already picking the unicode_escape codec.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.