Python HTMLParser: UnicodeDecodeError

Question

I'm using HTMLParser to parse pages I pull down with urllib, and am coming across UnicodeDecodeError exceptions when passing some to HTMLParser.

I tried using chardet to detect the encodings and to convert to ascii, or utf-8 (the docs don't seem to say what it should be). lossiness is acceptable, but while the decode/encode lines work just fine, I always get the error after self.feed().

The information is there if I just print it out.

from HTMLParser import HTMLParser
import urllib
import chardet

class search_youtube(HTMLParser):

    def __init__(self, search_terms):
        HTMLParser.__init__(self)
        self.track_ids = []
        for search in search_terms:
            self.__in_result = False
            search = urllib.quote_plus(search)
            query = 'http://youtube.com/results?search_query='
            page = urllib.urlopen(query + search).read()
            try:
                self.feed(page)
            except UnicodeDecodeError:
                encoding = chardet.detect(page)['encoding']
                if encoding != 'unicode':
                    page = page.decode(encoding)
                    page = page.encode('ascii', 'ignore')
                self.feed(page)
                print 'success'

searches = ['telepopmusik breathe']
results = search_youtube(searches)
print results.track_ids

here's the output:

Traceback (most recent call last):
  File "test.py", line 27, in <module>
    results = search_youtube(searches)
  File "test.py", line 23, in __init__
    self.feed(page)
  File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.6/HTMLParser.py", line 252, in parse_starttag
    attrvalue = self.unescape(attrvalue)
  File "/usr/lib/python2.6/HTMLParser.py", line 390, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/usr/lib/python2.6/re.py", line 151, in sub
    return _compile(pattern, 0).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

Lennart Regebro · Accepted Answer · 2011-01-25 06:37:14Z

18

It is UTF-8, indeed. This works:

from HTMLParser import HTMLParser
import urllib

class search_youtube(HTMLParser):

    def __init__(self, search_terms):
        HTMLParser.__init__(self)
        self.track_ids = []
        for search in search_terms:
            self.__in_result = False
            search = urllib.quote_plus(search)
            query = 'http://youtube.com/results?search_query='
            connection = urllib.urlopen(query + search)
            encoding = connection.headers.getparam('charset')
            page = connection.read().decode(encoding)
            self.feed(page)
            print 'success'

searches = ['telepopmusik breathe']
results = search_youtube(searches)
print results.track_ids

You don't need chardet, Youtube are not morons, they actually send the correct encoding in the header.

answered Jan 25, 2011 at 6:37

Lennart Regebro

173k45 gold badges230 silver badges254 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

John Machin · Accepted Answer · 2011-01-25 05:19:57Z

1

What encoding does chardet say it is?

Please explain "The information is there if I just print it out": what is "it"? If you can read it and it makes sense when you print it to your console, then it must be in the usual/default encoding for your system; what is that? What operating system? What locale?

Can you give us a typical URL to make a query so that we can inspect for ourselves what you are seeing?

At one place in your code, you decode your output, then immediately smash it by using .encode('ascii', 'ignore'); why?

answered Jan 25, 2011 at 5:19

John Machin

83.2k12 gold badges147 silver badges193 bronze badges

4 Comments

Nona Urbiz Over a year ago

the code I posted includes a sample url. chardet says the sample url is utf-8, but when using the program, other encodings are encountered (they all give the same unicode error). I can read it and it makes sense when it prints to my console. Ubuntu 10.10 is my OS. I have no reasoning for the decode/encode. I'm struggling to understand this, and have found numerous conflicting suggestions through google, that being one of them verbatim (i don't remember from where). thank you for your help. P.S. page.decode('utf-8'); self.feed(page) gives the same error.

William Over a year ago

Just to clarify, you have tried page = page.decode('utf-8'); self.feed(page)?

Nona Urbiz Over a year ago

yes i have, it gave me the same error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 16688: ordinal not in range(128)

Nona Urbiz Over a year ago

for the record, chardet.detect(page)['confidence'] == 0.98999999999999999

Collectives™ on Stack Overflow

Python HTMLParser: UnicodeDecodeError

2 Answers 2

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related