Dump JSON from string in unknown character encoding

Question

I'm trying to dump HTML from websites into JSON, and I need a way to handle the different character encodings.

I've read that if it isn't utf-8, it's probably ISO-8859-1, so what I'm doing now is:

for possible_encoding in ["utf-8", "ISO-8859-1"]:
   try:
      # post_dict contains, among other things, website html retrieved
      # with urllib2
      json = simplejson.dumps(post_dict, encoding=possible_encoding)
      break
   except UnicodeDecodeError:
      pass
if json is None:
      raise UnicodeDecodeError

This will of course fail if I come across any other encodings, so I'm wondering if there is a way to solve this problem in the general case.

The reason I'm trying to serialize the HTML in the first place is because I need to send it in a POST request to our NodeJS server. So, if someone has a different solution that allows me to do that (maybe without serializing to JSON at all), I'd be happy to hear that as well.

Does the HTML contain any <meta> tags? If so, you could check them to see if any of them are <meta http-equiv="Content-Type" /> and see if the content attribute tells you the character encoding. Alternatively, when retreiving the HTML in the first place, you may want to see if the Content-Type header includes an encoding. — Niet the Dark Absol
– Niet the Dark Absol, Commented Jan 29, 2013 at 20:23
In python, "ISO-8859-1" actually means ISO-8859-1. In web pages, ISO-8859-1 means Windows-1252 (cp1252 in python). Browsers actually use Windows-1252 to decode claimed ISO-8859-1 and this is specified in the html5 draft. So you want to specify ["utf-8", "cp1252"]. — Esailija
– Esailija, Commented Jan 30, 2013 at 13:27
See the replacement encodings here w3.org/TR/2009/WD-html5-20090423/… — Esailija
– Esailija, Commented Jan 30, 2013 at 13:34

Community · Accepted Answer · 2017-05-23 12:04:47Z

1

You should know the character encoding regardless of media type you use to send POST request (unless you want to send binary blobs). To get the character encoding of your html content, see A good way to get the charset/encoding of an HTTP response in Python .

To send post_dict as json, make sure all strings in it are Unicode (just convert html to Unicode as soon as you receive it) and don't use the encoding parameter for json.dumps() call. The parameter won't help you anyway if different web-sites (where you get your html strings) use different encodings.

edited May 23, 2017 at 12:04

CommunityBot

11 silver badge

answered Jan 30, 2013 at 4:22

jfs

417k210 gold badges1k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Dump JSON from string in unknown character encoding

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related