10

In Python 3, suppose I have

>>> thai_string = 'สีเ'

Using encode gives

>>> thai_string.encode('utf-8')
b'\xe0\xb8\xaa\xe0\xb8\xb5'

My question: how can I get encode() to return a bytes sequence using \u instead of \x? And how can I decode them back to a Python 3 str type?

I tried using the ascii builtin, which gives

>>> ascii(thai_string)
"'\\u0e2a\\u0e35'"

But this doesn't seem quite right, as I can't decode it back to obtain thai_string.

Python documentation tells me that

  • \xhh escapes the character with the hex value hh while
  • \uxxxx escapes the character with the 16-bit hex value xxxx

The documentation says that \u is only used in string literals, but I'm not sure what that means. Is this a hint that my question has a flawed premise?

7

1 Answer 1

13

You can use unicode_escape:

>>> thai_string.encode('unicode_escape')
b'\\u0e2a\\u0e35\\u0e40'

Note that encode() will always return a byte string (bytes) and the unicode_escape encoding is intended to:

Produce a string that is suitable as Unicode literal in Python source code

Sign up to request clarification or add additional context in comments.

3 Comments

Perfect. But why does this string have two slashes before the "u" while the "x" only has one?
This is simply how Python displays a literal backslash inside a quoted string. Compare '\\n' (literal backslash, literal n) to '\n' (newline character).
If you want the result as a string, you can tack on .decode('ascii')

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.