Decoding HTML Entities With Python

Question

The following Python code uses BeautifulStoneSoup to fetch the LibraryThing API information for Tolkien's "The Children of Húrin".

import urllib2

from BeautifulSoup import BeautifulStoneSoup

URL = ("http://www.librarything.com/services/rest/1.0/"
            "?method=librarything.ck.getwork&id=1907912"
            "&apikey=2a2e596b887f554db2bbbf3b07ff812a")

soup = BeautifulStoneSoup(urllib2.urlopen(URL),
                          convertEntities=BeautifulStoneSoup.ALL_ENTITIES)
title_field = soup.find('field', attrs={'name': 'canonicaltitle'})
print title_field.find('fact').string

Unfortunately, instead of 'Húrin', it prints out 'HÃºrin'. This is obviously an encoding issue, but I can't work out what I need to do to get the expected output. Help would be greatly appreciated.

sth · Accepted Answer · 2012-05-08 12:17:32Z

4

In the source of the web page it looks like this: The Children of HÃºrin. So the encoding is already broken somewhere on their side before it even gets converted to XML...

If it's a general issue with all the books and you need to work around it, this seems to work:

unicode(title_field.find('fact').string).encode("latin1").decode("utf-8")

edited May 8, 2012 at 12:17

answered Mar 9, 2009 at 23:05

sth

231k56 gold badges288 silver badges370 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Daniel Watkins Over a year ago

Yup, I guess that's it. I've contacted LibraryThing about sorting it out. Thanks. :)

Ignacio Vazquez-Abrams · Accepted Answer · 2009-03-09 22:53:49Z

1

The web page may be lying about its encoding. The output looks like UTF-8. If you got a str at the end then you'll need to decode it as UTF-8. If you have a unicode instead then you'll need to encode as Latin-1 first.

answered Mar 9, 2009 at 22:53

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

Collectives™ on Stack Overflow

Decoding HTML Entities With Python

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related