Get html using Python requests?

Question

I am trying to teach myself some basic web scraping. Using Python's requests module, I was able to grab html for various websites until I tried this:

>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')

Instead of the basic html that is the source for this page, I get:

>>> r.text
'\x1f\ufffd\x08\x00\x00\x00\x00\x00\x00\x03\ufffd]o\u06f8\x12\ufffd\ufffd\ufffd+\ufffd]...

>>> r.content
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed\x9d]o\xdb\xb8\x12\x86\xef\xfb+\x88]\x14h...

I have tried many combinations of get/post with every syntax I can guess from the documentation and from SO and other examples. I don't understand what I am seeing above, haven't been able to turn it into anything I can read, and can't figure out how to get what I actually want. My question is, how do I get the html for the above page?

Seems to work here, just tried it with the exact url on Python 2.7 — Kroltan
– Kroltan, Commented Jan 6, 2015 at 17:04
Id highly recommend BeautifulSoup for web scraping beautiful-soup-4.readthedocs.org/en/latest/#. It will make your life a heck of a lot easier. — Ron
– Ron, Commented Jan 6, 2015 at 17:04
@vikasdumca: requests is built on top of urllib3. The problem is the server here, however. — Martijn Pieters
– Martijn Pieters, Commented Jan 6, 2015 at 17:29

Martijn Pieters · Accepted Answer · 2015-02-07 22:03:39Z

31

The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:

$ curl -D - -o /dev/null -s -H 'Accept-Encoding: gzip, deflate' http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F
HTTP/1.1 200 OK
Date: Tue, 06 Jan 2015 17:46:49 GMT
Server: Apache
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html xmlns="http: //www.w3.org/1999/xhtml" lang="en-US">
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 3659
Content-Type: text/html

The <!DOCTYPE..> line there is not a valid HTTP header. As such, the remaining headers past Server are ignored. Why the server interjects that is unclear; in all likely hood WRCCWrappers.py is a CGI script that doesn't output headers but does include a double newline after the doctype line, duping the Apache server into inserting additional headers there.

As such, requests also doesn't detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasn't rather incomplete.

The work-around is to tell the server not to bother with compression:

headers = {'Accept-Encoding': 'identity'}
r = requests.get(url, headers=headers)

and an uncompressed response is returned.

Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:

>>> pprint(dict(r.headers))
{'<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "dtd/xhtml1-transitional.dtd"><html xmlns="http': '//www.w3.org/1999/xhtml" lang="en-US">',
 'connection': 'Keep-Alive',
 'content-encoding': 'gzip',
 'content-length': '3659',
 'content-type': 'text/html',
 'date': 'Tue, 06 Jan 2015 17:42:06 GMT',
 'keep-alive': 'timeout=5, max=100',
 'server': 'Apache',
 'vary': 'Accept-Encoding'}

and the content-encoding information survives, so there requests decodes the content for you, as expected.

edited Feb 7, 2015 at 22:03

answered Jan 6, 2015 at 17:29

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Padraic Cunningham Over a year ago

Yep, it is a python 3 problem. Works perfectly every time using python2

Martijn Pieters Over a year ago

@PadraicCunningham: no, it is a server problem. Python 2 just happens to not validate the header properly. It works in Python 2 but you get the <!DOCTYPE...> line as a header.

Rich Thompson Over a year ago

@MartijnPieters: It turns out that when I use the work around, the response content is corrupted by the addition of an occasional extra characters starting with the data for 1934. Based on your explanation, I instead decompressed the response content with zlib.decompress(r.content, 16+zlib.MAX_WBITS), which seems to have handled all issues.

Grant Over a year ago

FYI, the HTTP headers have now been fixed for this URL. I apologize for the error.

Martijn Pieters Over a year ago

@Grant: :-D No need to apologise to me though.

Grant · Accepted Answer · 2015-02-12 19:05:19Z

14

The HTTP headers for this URL have now been fixed.

>>> import requests
>>> print requests.__version__
2.5.1
>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')
>>> r.text[:100]
u'\n<!DOCTYPE html>\n<HTML>\n<HEAD><TITLE>Monthly Average of Precipitation, Station id: 028815</TITLE></H'
>>> r.headers
{'content-length': '3672', 'content-encoding': 'gzip', 'vary': 'Accept-Encoding', 'keep-alive': 'timeout=5, max=100', 'server': 'Apache', 'connection': 'Keep-Alive', 'date': 'Thu, 12 Feb 2015 18:59:37 GMT', 'content-type': 'text/html; charset=utf-8'}

answered Feb 12, 2015 at 19:05

Grant

2,9181 gold badge19 silver badges17 bronze badges

Comments

aidanmelen · Accepted Answer · 2022-01-27 15:50:31Z

13

Here is an example using the BeautifulSoup library. It "makes it easy to scrape information from web pages."

from bs4 import BeautifulSoup

import requests

# request web page
resp = requests.get("http://example.com")

# get the response text. in this case it is HTML
html = resp.text

# parse the HTML
soup = BeautifulSoup(html, "html.parser")

# print the HTML as text
print(soup.body.get_text().strip())

and the result

Example Domain
This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.
More information...

answered Jan 27, 2022 at 15:50

aidanmelen

6,7341 gold badge27 silver badges28 bronze badges

Comments

Ângelo Polotto · Accepted Answer · 2021-04-13 20:05:29Z

9

I'd solve that problem in a more simple way. Just import html library to decode HTML special characters:

import html

r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')

print(html.unescape(r.text))

answered Apr 13, 2021 at 20:05

Ângelo Polotto

9,6612 gold badges41 silver badges40 bronze badges

1 Comment

Padua Over a year ago

+rep 8 years later

Collectives™ on Stack Overflow

Get html using Python requests?

4 Answers 4

5 Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related