Python - cannot decode html (urllib)

Question

I'm trying to write html from webpage to file, but I have problem with decode characters:

import urllib.request

response = urllib.request.urlopen("https://www.google.com")

charset = response.info().get_content_charset()
print(response.read().decode(charset))

Last line causes error:

Traceback (most recent call last):
  File "script.py", line 7, in <module>
    print(response.read().decode(charset))
UnicodeEncodeError: 'ascii' codec can't encode character '\u015b' in 
position 6079: ordinal not in range(128)

response.info().get_content_charset() returns iso-8859-2, but if i check content of response without decoding (print(resposne.read())) there is "utf-8" encoding as html metatag. If i use "utf-8" in decode function there is also similar problem:

Traceback (most recent call last):
  File "script.py", line 7, in <module>
    print(response.read().decode("utf-8"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 
6111: invalid start byte

What's going on?

@theausome Not if i use file.write() function which expect string. — Robin71
– Robin71, Commented Jan 29, 2018 at 17:19

Joe · Accepted Answer · 2018-01-29 17:52:03Z

3

You can ignore invalid characters using

response.read().decode("utf-8", 'ignore')

Instead of ignore there are other options, e.g. replace

https://www.tutorialspoint.com/python/string_encode.htm

https://docs.python.org/3/howto/unicode.html#the-string-type

(There is also str.encode(encoding='UTF-8',errors='strict') for strings.)

edited Jan 29, 2018 at 17:52

answered Jan 29, 2018 at 17:44

Joe

7,2433 gold badges31 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Maxim Over a year ago

Is this fine to do when print(resp.info().get_content_charset()) returns None? Wasn't sure if this was what OP was also seeing as they stored it in a variable.

Joe Over a year ago

I admit it's not totally clean. This means that the system was not able to detect the encodig, probably because it was not expicitly stated in the headers. See stackoverflow.com/a/24372670/7919597 and stackoverflow.com/questions/4981977/… and stackoverflow.com/questions/14592762/… for other approaches to get the charset. "In general the server may lie about the encoding or do not report it at all".

Collectives™ on Stack Overflow

Python - cannot decode html (urllib)

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related