0

The following Python code uses BeautifulStoneSoup to fetch the LibraryThing API information for Tolkien's "The Children of Húrin".

import urllib2

from BeautifulSoup import BeautifulStoneSoup

URL = ("http://www.librarything.com/services/rest/1.0/"
            "?method=librarything.ck.getwork&id=1907912"
            "&apikey=2a2e596b887f554db2bbbf3b07ff812a")

soup = BeautifulStoneSoup(urllib2.urlopen(URL),
                          convertEntities=BeautifulStoneSoup.ALL_ENTITIES)
title_field = soup.find('field', attrs={'name': 'canonicaltitle'})
print title_field.find('fact').string

Unfortunately, instead of 'Húrin', it prints out 'Húrin'. This is obviously an encoding issue, but I can't work out what I need to do to get the expected output. Help would be greatly appreciated.

2 Answers 2

4

In the source of the web page it looks like this: The Children of Húrin. So the encoding is already broken somewhere on their side before it even gets converted to XML...

If it's a general issue with all the books and you need to work around it, this seems to work:

unicode(title_field.find('fact').string).encode("latin1").decode("utf-8")
Sign up to request clarification or add additional context in comments.

1 Comment

Yup, I guess that's it. I've contacted LibraryThing about sorting it out. Thanks. :)
1

The web page may be lying about its encoding. The output looks like UTF-8. If you got a str at the end then you'll need to decode it as UTF-8. If you have a unicode instead then you'll need to encode as Latin-1 first.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.