I'm trying to dump HTML from websites into JSON, and I need a way to handle the different character encodings.
I've read that if it isn't utf-8, it's probably ISO-8859-1, so what I'm doing now is:
for possible_encoding in ["utf-8", "ISO-8859-1"]:
try:
# post_dict contains, among other things, website html retrieved
# with urllib2
json = simplejson.dumps(post_dict, encoding=possible_encoding)
break
except UnicodeDecodeError:
pass
if json is None:
raise UnicodeDecodeError
This will of course fail if I come across any other encodings, so I'm wondering if there is a way to solve this problem in the general case.
The reason I'm trying to serialize the HTML in the first place is because I need to send it in a POST request to our NodeJS server. So, if someone has a different solution that allows me to do that (maybe without serializing to JSON at all), I'd be happy to hear that as well.
<meta>tags? If so, you could check them to see if any of them are<meta http-equiv="Content-Type" />and see if thecontentattribute tells you the character encoding. Alternatively, when retreiving the HTML in the first place, you may want to see if theContent-Typeheader includes an encoding."ISO-8859-1"actually means ISO-8859-1. In web pages, ISO-8859-1 means Windows-1252 (cp1252in python). Browsers actually use Windows-1252 to decode claimed ISO-8859-1 and this is specified in the html5 draft. So you want to specify["utf-8", "cp1252"].