Decoding response with mixed UTF-8 encoding in Python

Question

I'm downloading data from a website using aiohttp and I'm getting a bytes object as a response but I'm having an hard time decoding it. This is an example of the reponse I get

b'\\r\\nLocalit\xc3\xa0' # Località
b'\\u003cdiv\\u003e12/09/2019\\u003c/div\\u003e\\r\\n' # <div>12/09/2019</div>

From what I understand it has normal unicode for text and escaped unicode for the html tags and line feed. If I try to decode it using "str(content, "utf-8")" I still have the html tags in this format

\u003cdiv \u003e12/09/2019\u003c/div\u003e\r\n

Should I just do a manual .replace("\u003", "<") for every tag or is there a more elegant solution?

Thierry Lathuille · Accepted Answer · 2020-04-26 10:32:04Z

2

You could use the 'unicode-escape' codec to convert the unicode part, then reencode transparently to bytes (latin-1 is convenient for this, as is provides a 1-to-1 correspondance between bytes and chars), then decode as 'utf-8':

b = b'\\u003cdiv\\u003e12/09/2019\\u003c/div\\u003e\\r\\n\\r\\nLocalit\xc3\xa0'
b.decode('unicode-escape').encode('latin1').decode('utf8')
# '<div>12/09/2019</div>\r\n\r\nLocalità'

edited Apr 26, 2020 at 10:32

answered Apr 26, 2020 at 9:53

Thierry Lathuille

24.4k10 gold badges49 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Nicola Over a year ago

If I use unicode-escape on the text i get LocalitÃ instead of Località

Thierry Lathuille Over a year ago

Sorry, I had missed that part, I edited the answer!

Collectives™ on Stack Overflow

Decoding response with mixed UTF-8 encoding in Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related