Python urllib2 parse html problem

Question

I am using mechanize to parse html of website, but with this website i got strange result.

from mechanize import Browser
br = Browser()
r = br.open("http://www.heavenplaza.com")
result = r.read()

result is something which i can not understand. you can see here: http://paste2.org/p/1556077

Anyone can have some method to get that website HTML? with mechanize or urllib.

Thanks

Please post the result in the answer rather than in a pastebin. Especially when the result is one-line long! — senderle
– senderle, Commented Aug 1, 2011 at 13:47

ksn · Accepted Answer · 2011-08-01 13:52:58Z

1

import urllib2, StringIO, gzip
f = urllib2.urlopen("http://www.heavenplaza.com")
data = StringIO.StringIO(f.read())
gzipper = gzip.GzipFile(fileobj=data)
print gzipper.read()

answered Aug 1, 2011 at 13:52

ksn

6332 gold badges6 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Mikko Ohtamaa · Accepted Answer · 2011-08-01 13:47:30Z

1

I quickly checked the script in the console and the site was returning crap. You probably need to spoof your HTTP user agent to be something else that the site doesn't think you are using a robot.

http://www.google.com works

answered Aug 1, 2011 at 13:47

Mikko Ohtamaa

85k63 gold badges296 silver badges479 bronze badges

2 Comments

kairyu Over a year ago

This is my user-Agent: br.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.17) Gecko/20110420 Firefox/3.6.17')] and it is not work too.

Mikko Ohtamaa Over a year ago

Based on the reply above the site does not correctly honour/use accept-ending gzip headers

Collectives™ on Stack Overflow

Python urllib2 parse html problem

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related