3

I'm downloading data from a website using aiohttp and I'm getting a bytes object as a response but I'm having an hard time decoding it. This is an example of the reponse I get

b'\\r\\nLocalit\xc3\xa0' # Località
b'\\u003cdiv\\u003e12/09/2019\\u003c/div\\u003e\\r\\n' # <div>12/09/2019</div>

From what I understand it has normal unicode for text and escaped unicode for the html tags and line feed. If I try to decode it using "str(content, "utf-8")" I still have the html tags in this format

\u003cdiv \u003e12/09/2019\u003c/div\u003e\r\n

Should I just do a manual .replace("\u003", "<") for every tag or is there a more elegant solution?

1 Answer 1

2

You could use the 'unicode-escape' codec to convert the unicode part, then reencode transparently to bytes (latin-1 is convenient for this, as is provides a 1-to-1 correspondance between bytes and chars), then decode as 'utf-8':

b = b'\\u003cdiv\\u003e12/09/2019\\u003c/div\\u003e\\r\\n\\r\\nLocalit\xc3\xa0'
b.decode('unicode-escape').encode('latin1').decode('utf8')
# '<div>12/09/2019</div>\r\n\r\nLocalità'
Sign up to request clarification or add additional context in comments.

2 Comments

If I use unicode-escape on the text i get Località instead of Località
Sorry, I had missed that part, I edited the answer!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.