2

I'm trying to write html from webpage to file, but I have problem with decode characters:

import urllib.request

response = urllib.request.urlopen("https://www.google.com")

charset = response.info().get_content_charset()
print(response.read().decode(charset))

Last line causes error:

Traceback (most recent call last):
  File "script.py", line 7, in <module>
    print(response.read().decode(charset))
UnicodeEncodeError: 'ascii' codec can't encode character '\u015b' in 
position 6079: ordinal not in range(128)

response.info().get_content_charset() returns iso-8859-2, but if i check content of response without decoding (print(resposne.read())) there is "utf-8" encoding as html metatag. If i use "utf-8" in decode function there is also similar problem:

Traceback (most recent call last):
  File "script.py", line 7, in <module>
    print(response.read().decode("utf-8"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 
6111: invalid start byte

What's going on?

3
  • Simply print(response.read()) should work. Commented Jan 29, 2018 at 17:14
  • @theausome Not if i use file.write() function which expect string. Commented Jan 29, 2018 at 17:19
  • Why don't you open the file in binary mode? Commented Jan 30, 2018 at 19:29

1 Answer 1

3

You can ignore invalid characters using

response.read().decode("utf-8", 'ignore')

Instead of ignore there are other options, e.g. replace

https://www.tutorialspoint.com/python/string_encode.htm

https://docs.python.org/3/howto/unicode.html#the-string-type

(There is also str.encode(encoding='UTF-8',errors='strict') for strings.)

Sign up to request clarification or add additional context in comments.

2 Comments

Is this fine to do when print(resp.info().get_content_charset()) returns None? Wasn't sure if this was what OP was also seeing as they stored it in a variable.
I admit it's not totally clean. This means that the system was not able to detect the encodig, probably because it was not expicitly stated in the headers. See stackoverflow.com/a/24372670/7919597 and stackoverflow.com/questions/4981977/… and stackoverflow.com/questions/14592762/… for other approaches to get the charset. "In general the server may lie about the encoding or do not report it at all".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.