3

What's wrong with this code?

>>> from urllib.request import urlopen
>>> for line in urlopen("http://google.com/"):
       print(line.decode("utf-8"))


<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=windows-1251"><title>Google</title><script>window.google={kEI:"XMECT7XyDcGn0AWFk7ywAQ",getEI:function(a){var b;while(a&&!(a.getAttribute&&(b=a.getAttribute("eid"))))a=a.parentNode;return b||google.kEI},https:function(){return window.location.protocol=="https:"},kEXPI:"33492,35300",kCSI:{e:"33492,35300",ei:"XMECT7XyDcGn0AWFk7ywAQ"},authuser:0,

ml:function(){},kHL:"uk",time:function(){return(new Date).getTime()},log:function(a,b,c,e){var d=new Image,g=google,h=g.lc,f=g.li,j="";d.onerror=(d.onload=(d.onabort=function(){delete h[f]}));h[f]=d;if(!c&&b.search("&ei=")==-1)j="&ei="+google.getEI(e);var i=c||"/gen_204?atyp=i&ct="+a+"&cad="+b+j+"&zx="+google.time(),k=/^http:/i;if(k.test(i)&&google.https()){google.ml(new Error("GLMM"),false,{src:i});

delete h[f];return}d.src=i;g.li=f+1},lc:[],li:0,Toolbelt:{},y:{},x:function(a,b){google.y[a.id]=

[a,b];return false}};

window.google.sn="webhp";window.google.timers={};window.google.startTick=function(a,b){window.google.timers[a]={t:{start:(new Date).getTime()},bfr:!(!b)}};window.google.tick=function(a,b,c){if(!window.google.timers[a])google.startTick(a);window.google.timers[a].t[b]=c||(new Date).getTime()};google.startTick("load",true);try{}catch(u){}

var _gjwl=location;function _gjuc(){var e=_gjwl.href.indexOf("#");if(e>=0){var a=_gjwl.href.substring(e);if(a.indexOf("&q=")>0||a.indexOf("#q=")>=0){a=a.substring(1);if(a.indexOf("#")==-1){for(var c=0;c<a.length;){var d=c;if(a.charAt(d)=="&")++d;var b=a.indexOf("&",d);if(b==-1)b=a.length;var f=a.substring(d,b);if(f.indexOf("fp=")==0){a=a.substring(0,c)+a.substring(b,a.length);b=c}else if(f=="cad=h")return 0;c=b}_gjwl.href="/search?"+a+"&cad=h";return 1}}}return 0}function _gjp(){!(window._gjwl.hash&&

window._gjuc())&&setTimeout(_gjp,500)};

Traceback (most recent call last):
  File "<pyshell#109>", line 2, in <module>
    print(line.decode("utf-8"))
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 2364: invalid continuation byte
0

2 Answers 2

6

Google sends you text in windows-1251 encoding, it says it in meta tag. This will work:

>>> from urllib.request import urlopen
>>> for line in urlopen("http://google.com/"):
       print(line.decode("cp1251"))
Sign up to request clarification or add additional context in comments.

Comments

2

That's your failing line (last part of it):

>>> line
b'<a class=gb1 href="http://www.google.es/imghp?hl=es&tab=wi">Im\xe1genes</a>'
>>> line.decode()
Traceback (most recent call last):
  File "<pyshell#12>", line 1, in <module>
    line.decode()
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 62: invalid continuation byte

The failing code is from a spanish word that has accent:

>>> bite = 0xe1
>>> bite
225
>>> chr(225)
'á'

You will be ok with latins decoding accordingly:

>>> line.decode('latin-1')
'<a class=gb1 href="http://www.google.es/imghp?hl=es&tab=wi">Imágenes</a>'

btw, Imágenes is spanish images

2 Comments

Seems Google returns localized page depending on IP. For me it's Russian and cp1251 encoding. For you it's Spanish and latin-1.
@race1 Oh I see! Interesting... I was fooled because my error was at pos 2419 after the same line the OP posted. But the one of the OP is at 2364... These are coincident answers by coincidence, arent they?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.