4

I am trying to read a value in a html page into a variable in a python script. I have already figured out a way of downloading the page to a local file using urllib and could extract the value with a bash script but would like to try it in Python.

import urllib
urllib.urlretrieve('http://url.com', 'page.htm')

The page has this in it:

<div name="mainbody" style="font-size: x-large;margin:auto;width:33;">
<b><a href="w.cgi?hsn=10543">Plateau (19:01)</a></b>
<br/> Wired: 17.4
<br/>P10 Chard: 16.7
<br/>P1 P. Gris: 17.1
<br/>P20 Pinot Noir: 15.8-
<br/>Soil Temp : Error
<br/>Rainfall: 0.2<br/>
</div>

I need the 17.4 value from the Wired: line

Any suggestions?

Thanks

0

3 Answers 3

4

Start with not using urlretrieve(); you want the data, not a file.

Next, use a HTML parser. BeautifulSoup is great for extracting text from HTML.

Retrieving the page with urllib2 would be:

from urllib2 import urlopen

response = urlopen('http://url.com/')

then read the data into BeautifulSoup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.read(), from_encoding=response.headers.getparam('charset'))

The from_encoding part there will tell BeautifulSoup what encoding the web server told you to use for the page; if the web server did not specify this then BeautifulSoup will make an educated guess for you.

Now you can search for your data:

for line in soup.find('div', {'name': 'mainbody'}).stripped_strings:
    if 'Wired:' in line:
        value = float(line.partition('Wired:')[2])
        print value

For your demo HTML snippet that gives:

>>> for line in soup.find('div', {'name': 'mainbody'}).stripped_strings:
...     if 'Wired:' in line:
...         value = float(line.partition('Wired:')[2])
...         print value
... 
17.4
Sign up to request clarification or add additional context in comments.

1 Comment

@beroe: The function the OP used has the signature urlretrieve(url, filename); page.html is the filename the page was stored at, not part of the URL.
4

This is called web scraping and there's a very popular library for doing this in Python, it's called Beautiful Soup:

http://www.crummy.com/software/BeautifulSoup/

If you'd like to do it with urllib/urllib2, you can accomplish that using regular expressions:

http://docs.python.org/2/library/re.html

Using regex, you basically use the surrounding context of your desired value as the key, then strip the key away. So in this case you might match from "Wired: " to the next newline character, then strip away the "Wired: " and the newline character.

Comments

0

You can run through the file, line by line using find or a regular expression to check for the value(s) you need or you can consider using scrapy to retrieve and parse the link.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.