Read value from web page using python

Question

I am trying to read a value in a html page into a variable in a python script. I have already figured out a way of downloading the page to a local file using urllib and could extract the value with a bash script but would like to try it in Python.

import urllib
urllib.urlretrieve('http://url.com', 'page.htm')

The page has this in it:

<div name="mainbody" style="font-size: x-large;margin:auto;width:33;">
<b><a href="w.cgi?hsn=10543">Plateau (19:01)</a></b>
<br/> Wired: 17.4
<br/>P10 Chard: 16.7
<br/>P1 P. Gris: 17.1
<br/>P20 Pinot Noir: 15.8-
<br/>Soil Temp : Error
<br/>Rainfall: 0.2<br/>
</div>

I need the 17.4 value from the Wired: line

Any suggestions?

Thanks

Martijn Pieters · Accepted Answer · 2013-10-06 00:02:01Z

4

Start with not using urlretrieve(); you want the data, not a file.

Next, use a HTML parser. BeautifulSoup is great for extracting text from HTML.

Retrieving the page with urllib2 would be:

from urllib2 import urlopen

response = urlopen('http://url.com/')

then read the data into BeautifulSoup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.read(), from_encoding=response.headers.getparam('charset'))

The from_encoding part there will tell BeautifulSoup what encoding the web server told you to use for the page; if the web server did not specify this then BeautifulSoup will make an educated guess for you.

Now you can search for your data:

for line in soup.find('div', {'name': 'mainbody'}).stripped_strings:
    if 'Wired:' in line:
        value = float(line.partition('Wired:')[2])
        print value

For your demo HTML snippet that gives:

>>> for line in soup.find('div', {'name': 'mainbody'}).stripped_strings:
...     if 'Wired:' in line:
...         value = float(line.partition('Wired:')[2])
...         print value
... 
17.4

edited Oct 6, 2013 at 0:02

answered Oct 4, 2013 at 7:05

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Martijn Pieters Over a year ago

@beroe: The function the OP used has the signature urlretrieve(url, filename); page.html is the filename the page was stored at, not part of the URL.

Adelmar · Accepted Answer · 2015-04-14 06:20:43Z

4

This is called web scraping and there's a very popular library for doing this in Python, it's called Beautiful Soup:

http://www.crummy.com/software/BeautifulSoup/

If you'd like to do it with urllib/urllib2, you can accomplish that using regular expressions:

http://docs.python.org/2/library/re.html

Using regex, you basically use the surrounding context of your desired value as the key, then strip the key away. So in this case you might match from "Wired: " to the next newline character, then strip away the "Wired: " and the newline character.

edited Apr 14, 2015 at 6:20

answered Oct 4, 2013 at 7:13

Adelmar

2,1112 gold badges21 silver badges20 bronze badges

Comments

Steve Barnes · Accepted Answer · 2013-10-04 07:05:05Z

0

You can run through the file, line by line using find or a regular expression to check for the value(s) you need or you can consider using scrapy to retrieve and parse the link.

answered Oct 4, 2013 at 7:05

Steve Barnes

28.5k6 gold badges68 silver badges80 bronze badges

Collectives™ on Stack Overflow

Read value from web page using python

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related