4

I'm trying to dump HTML from websites into JSON, and I need a way to handle the different character encodings.

I've read that if it isn't utf-8, it's probably ISO-8859-1, so what I'm doing now is:

for possible_encoding in ["utf-8", "ISO-8859-1"]:
   try:
      # post_dict contains, among other things, website html retrieved
      # with urllib2
      json = simplejson.dumps(post_dict, encoding=possible_encoding)
      break
   except UnicodeDecodeError:
      pass
if json is None:
      raise UnicodeDecodeError

This will of course fail if I come across any other encodings, so I'm wondering if there is a way to solve this problem in the general case.

The reason I'm trying to serialize the HTML in the first place is because I need to send it in a POST request to our NodeJS server. So, if someone has a different solution that allows me to do that (maybe without serializing to JSON at all), I'd be happy to hear that as well.

3
  • Does the HTML contain any <meta> tags? If so, you could check them to see if any of them are <meta http-equiv="Content-Type" /> and see if the content attribute tells you the character encoding. Alternatively, when retreiving the HTML in the first place, you may want to see if the Content-Type header includes an encoding. Commented Jan 29, 2013 at 20:23
  • 1
    In python, "ISO-8859-1" actually means ISO-8859-1. In web pages, ISO-8859-1 means Windows-1252 (cp1252 in python). Browsers actually use Windows-1252 to decode claimed ISO-8859-1 and this is specified in the html5 draft. So you want to specify ["utf-8", "cp1252"]. Commented Jan 30, 2013 at 13:27
  • 1
    See the replacement encodings here w3.org/TR/2009/WD-html5-20090423/… Commented Jan 30, 2013 at 13:34

1 Answer 1

1

You should know the character encoding regardless of media type you use to send POST request (unless you want to send binary blobs). To get the character encoding of your html content, see A good way to get the charset/encoding of an HTTP response in Python .

To send post_dict as json, make sure all strings in it are Unicode (just convert html to Unicode as soon as you receive it) and don't use the encoding parameter for json.dumps() call. The parameter won't help you anyway if different web-sites (where you get your html strings) use different encodings.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.