UnicodeDecodeError Python Error

Question

I'm trying to code a python google api. Getting some unicode issues. My really basic PoC so far is:

#!/usr/bin/env python
import urllib2
from bs4 import BeautifulSoup        
query = "filetype%3Apdf"
url = "http://www.google.com/search?sclient=psy-ab&hl=en&site=&source=hp&q="+query+"&btnG=Search"
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open(url)
data = response.read()
data = data.decode('UTF-8', 'ignore')
data = data.encode('UTF-8', 'ignore')
soup = BeautifulSoup(data)
print u""+soup.prettify('UTF-8')

My traceback is:

Traceback (most recent call last):
  File "./google.py", line 22, in <module>
print u""+soup.prettify('UTF-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 48786: ordinal not in range(128)

Any ideas?

Martijn Pieters · Accepted Answer · 2012-09-15 13:01:01Z

4

You are converting your soup tree to UTF-8 (an encoded byte string), then try to concatenate this to an empty u'' unicode string.

Python will automatically try and decode your encoded byte string, using the default encoding, which is ASCII, and fails to decode the UTF-8 data.

You need to explicitly decode the prettify() output:

print u"" + soup.prettify('UTF-8').decode('UTF-8')

The Python Unicode HOWTO explains this better, including about default encodings. I really, really recommend you read Joel Spolsky's article on Unicode as well.

answered Sep 15, 2012 at 13:01

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

UnicodeDecodeError Python Error

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related