0

Tried to look through a few similar threads, but still confused:

I have a byte string with some special characters (for a double quote in my case) like below. What's the easiest way to properly convert it to a string, so that the special characters are mapped correctly?

b = b'My groovy str\xe2\x80\x9d is now fixed'

Update: regarding decode('utf-8')

>>> b = b'My groovy str\xe2\x80\x9d is now fixed'
>>> b_converted = b.decode("utf-8") 
>>> b_converted
'My groovy str\u201d is now fixed'
>>> print(b_converted)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\u201d' in position 13: ordinal not in range(128)

2 Answers 2

2

The following should work:

b_converted = b.decode("utf-8") 

Converted from:

b'My groovy str\xe2\x80\x9d is now fixed'

To:

My groovy str” is now fixed
Sign up to request clarification or add additional context in comments.

10 Comments

That works for the particular example included in the question, but it's important to note that there are lots of different choices that can be passed to encode. You really need to know the source of the byte string to know the proper parameter.
Here's what I get: >>> b.decode('utf-8') 'My groovy str\u201d is now fixed'
Python 3.6, but it doesn't work for me in 2.7 either. Could it be because I typed those special characters in?
I think you're right - I get some funny behavior, when I export LC_ALL=C.UTF-8
'\u201d' == '”'. The display difference is that on Python3 if the Unicode is a printable character in the encoding supported by the terminal it will print the character, else it will display an escape code in a repr() string. On Python 2 only ASCII printable characters are un-escaped. when you actually print to the terminal you get the str() is used and non-printable characters give UnicodeEncodeError if the character can't be printed in the terminal's encoding.
|
2

Use .decode(encoding) on a byte string to convert it to Unicode.

Encoding can not always be determined and depends on the source. In this case it is clearly utf8.

Ideally when reading text strings the API used to read the data can specify the encoding or in the case of website requests detect it from response headers, so you don't need to .decode explicitly, for example:

with open('input.txt',encoding='utf8') as file:
    text = file.read()

or

import requests
response = requests.get('http://example.com')
print(response.encoding)
print(response.text) # translated from encoding

5 Comments

Thank you, please see my comment above. Maybe it doesn't work b/c decode doesn't escape special chars, but 'unicode escape' didn't work for me either
@LazyCat That's due to your execution environment. Your output terminal using a code page that doesn't support all Unicode characters. print is using the ascii codec to encode the Unicode string to the terminal. What are you running Python under (for example, "Windows 10 64-bit, cmd.exe terminal").
Thanks, just under plain linux, so wasn't expecting such issues
@LazyCat To be clear, the .decode() is working and the print is doing .encode('ascii') to the terminal and failing. This indicates the terminal isn't configured correctly for UTF-8. I'm not using Linux, but IIRC Python uses LC_ALL environment variable to detect the terminal encoding. That's a separate problem than your original question, and there are plenty of existing answers on SO dealing with print issues.
Try export LC_ALL='en_US.utf8'

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.