properly converting special chars in python byte string

Question

Tried to look through a few similar threads, but still confused:

I have a byte string with some special characters (for a double quote in my case) like below. What's the easiest way to properly convert it to a string, so that the special characters are mapped correctly?

b = b'My groovy str\xe2\x80\x9d is now fixed'

Update: regarding decode('utf-8')

>>> b = b'My groovy str\xe2\x80\x9d is now fixed'
>>> b_converted = b.decode("utf-8") 
>>> b_converted
'My groovy str\u201d is now fixed'
>>> print(b_converted)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\u201d' in position 13: ordinal not in range(128)

David Duran · Accepted Answer · 2020-07-29 16:00:49Z

2

The following should work:

b_converted = b.decode("utf-8")

Converted from:

b'My groovy str\xe2\x80\x9d is now fixed'

To:

My groovy str” is now fixed

answered Jul 29, 2020 at 16:00

David Duran

1,8361 gold badge28 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Mark Ransom Over a year ago

That works for the particular example included in the question, but it's important to note that there are lots of different choices that can be passed to encode. You really need to know the source of the byte string to know the proper parameter.

LazyCat Over a year ago

Here's what I get: >>> b.decode('utf-8') 'My groovy str\u201d is now fixed'

LazyCat Over a year ago

Python 3.6, but it doesn't work for me in 2.7 either. Could it be because I typed those special characters in?

LazyCat Over a year ago

I think you're right - I get some funny behavior, when I export LC_ALL=C.UTF-8

Mark Tolonen Over a year ago

'\u201d' == '”'. The display difference is that on Python3 if the Unicode is a printable character in the encoding supported by the terminal it will print the character, else it will display an escape code in a repr() string. On Python 2 only ASCII printable characters are un-escaped. when you actually print to the terminal you get the str() is used and non-printable characters give UnicodeEncodeError if the character can't be printed in the terminal's encoding.

|

Mark Tolonen · Accepted Answer · 2020-07-29 16:18:00Z

2

Use .decode(encoding) on a byte string to convert it to Unicode.

Encoding can not always be determined and depends on the source. In this case it is clearly utf8.

Ideally when reading text strings the API used to read the data can specify the encoding or in the case of website requests detect it from response headers, so you don't need to .decode explicitly, for example:

with open('input.txt',encoding='utf8') as file:
    text = file.read()

or

import requests
response = requests.get('http://example.com')
print(response.encoding)
print(response.text) # translated from encoding

answered Jul 29, 2020 at 16:18

Mark Tolonen

181k26 gold badges182 silver badges278 bronze badges

5 Comments

LazyCat Over a year ago

Thank you, please see my comment above. Maybe it doesn't work b/c decode doesn't escape special chars, but 'unicode escape' didn't work for me either

Mark Tolonen Over a year ago

@LazyCat That's due to your execution environment. Your output terminal using a code page that doesn't support all Unicode characters. print is using the ascii codec to encode the Unicode string to the terminal. What are you running Python under (for example, "Windows 10 64-bit, cmd.exe terminal").

LazyCat Over a year ago

Thanks, just under plain linux, so wasn't expecting such issues

Mark Tolonen Over a year ago

@LazyCat To be clear, the .decode() is working and the print is doing .encode('ascii') to the terminal and failing. This indicates the terminal isn't configured correctly for UTF-8. I'm not using Linux, but IIRC Python uses LC_ALL environment variable to detect the terminal encoding. That's a separate problem than your original question, and there are plenty of existing answers on SO dealing with print issues.

Mark Tolonen Over a year ago

Try export LC_ALL='en_US.utf8'

Collectives™ on Stack Overflow

properly converting special chars in python byte string

2 Answers 2

10 Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related