Error while using urllib.request.urlopen in Python

Question

What's wrong with this code?

>>> from urllib.request import urlopen
>>> for line in urlopen("http://google.com/"):
       print(line.decode("utf-8"))


<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=windows-1251"><title>Google</title><script>window.google={kEI:"XMECT7XyDcGn0AWFk7ywAQ",getEI:function(a){var b;while(a&&!(a.getAttribute&&(b=a.getAttribute("eid"))))a=a.parentNode;return b||google.kEI},https:function(){return window.location.protocol=="https:"},kEXPI:"33492,35300",kCSI:{e:"33492,35300",ei:"XMECT7XyDcGn0AWFk7ywAQ"},authuser:0,

ml:function(){},kHL:"uk",time:function(){return(new Date).getTime()},log:function(a,b,c,e){var d=new Image,g=google,h=g.lc,f=g.li,j="";d.onerror=(d.onload=(d.onabort=function(){delete h[f]}));h[f]=d;if(!c&&b.search("&ei=")==-1)j="&ei="+google.getEI(e);var i=c||"/gen_204?atyp=i&ct="+a+"&cad="+b+j+"&zx="+google.time(),k=/^http:/i;if(k.test(i)&&google.https()){google.ml(new Error("GLMM"),false,{src:i});

delete h[f];return}d.src=i;g.li=f+1},lc:[],li:0,Toolbelt:{},y:{},x:function(a,b){google.y[a.id]=

[a,b];return false}};

window.google.sn="webhp";window.google.timers={};window.google.startTick=function(a,b){window.google.timers[a]={t:{start:(new Date).getTime()},bfr:!(!b)}};window.google.tick=function(a,b,c){if(!window.google.timers[a])google.startTick(a);window.google.timers[a].t[b]=c||(new Date).getTime()};google.startTick("load",true);try{}catch(u){}

var _gjwl=location;function _gjuc(){var e=_gjwl.href.indexOf("#");if(e>=0){var a=_gjwl.href.substring(e);if(a.indexOf("&q=")>0||a.indexOf("#q=")>=0){a=a.substring(1);if(a.indexOf("#")==-1){for(var c=0;c<a.length;){var d=c;if(a.charAt(d)=="&")++d;var b=a.indexOf("&",d);if(b==-1)b=a.length;var f=a.substring(d,b);if(f.indexOf("fp=")==0){a=a.substring(0,c)+a.substring(b,a.length);b=c}else if(f=="cad=h")return 0;c=b}_gjwl.href="/search?"+a+"&cad=h";return 1}}}return 0}function _gjp(){!(window._gjwl.hash&&

window._gjuc())&&setTimeout(_gjp,500)};

Traceback (most recent call last):
  File "<pyshell#109>", line 2, in <module>
    print(line.decode("utf-8"))
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 2364: invalid continuation byte

Sergey · Accepted Answer · 2012-01-03 10:33:43Z

6

Google sends you text in windows-1251 encoding, it says it in meta tag. This will work:

>>> from urllib.request import urlopen
>>> for line in urlopen("http://google.com/"):
       print(line.decode("cp1251"))

edited Jan 3, 2012 at 10:33

Sergey

50.1k28 gold badges93 silver badges132 bronze badges

answered Jan 3, 2012 at 9:09

demalexx

4,7692 gold badges33 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

joaquin · Accepted Answer · 2012-01-03 09:29:56Z

2

That's your failing line (last part of it):

>>> line
b'<a class=gb1 href="http://www.google.es/imghp?hl=es&tab=wi">Im\xe1genes</a>'
>>> line.decode()
Traceback (most recent call last):
  File "<pyshell#12>", line 1, in <module>
    line.decode()
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 62: invalid continuation byte

The failing code is from a spanish word that has accent:

>>> bite = 0xe1
>>> bite
225
>>> chr(225)
'á'

You will be ok with latins decoding accordingly:

>>> line.decode('latin-1')
'<a class=gb1 href="http://www.google.es/imghp?hl=es&tab=wi">Imágenes</a>'

btw, Imágenes is spanish images

edited Jan 3, 2012 at 9:29

answered Jan 3, 2012 at 9:13

joaquin

86k31 gold badges146 silver badges155 bronze badges

2 Comments

demalexx Over a year ago

Seems Google returns localized page depending on IP. For me it's Russian and cp1251 encoding. For you it's Spanish and latin-1.

joaquin Over a year ago

@race1 Oh I see! Interesting... I was fooled because my error was at pos 2419 after the same line the OP posted. But the one of the OP is at 2364... These are coincident answers by coincidence, arent they?

Collectives™ on Stack Overflow

Error while using urllib.request.urlopen in Python

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related