Trouble parsing HTML page with Python

Question

I'm trying to get a hold of the data under the columns having the code "SEVNYXX", where "XX" are the numbers that follow (eg. 01, 02, etc) on the site http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html using Python. I am currently using the following method as prescribed by the site http://docs.python-guide.org/en/latest/scenarios/scrape/ . However, I don't know how to determine the divs for this page and am hence unable to proceed and was hoping to get some help with this.

This is what I have so far:

from lxml import html
import requests
page = requests.get('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html')
tree = html.fromstring(page.text)

Thank You

Does it need to be python? It seems the page is static, and if you simply copy/paste the table to a spread sheet, you can easily extract the columns manually. That might be easier. Processing HTML with xpath is not the easiest thing to conquer. — Gerard van Helden
– Gerard van Helden, Commented Jun 9, 2015 at 17:22
@GerardvanHelden Thank You. However, if the page is updated can't I then simply re-download the data through my code? Is there an easier way to process HTML than using xpath? — user131983
– user131983, Commented Jun 9, 2015 at 17:31
If it is indeed dynamic then you do need some kind of scripting :) BeautifulSoup would have been my next recommendation. — Gerard van Helden
– Gerard van Helden, Commented Jun 12, 2015 at 18:23

Msg · Accepted Answer · 2015-06-10 15:08:53Z

1

Have you tried using BeautifulSoup? I'm a pretty big fan. Using that you can easily iterate through all of the info you want, searching by tag.

Here's something I threw together, that prints out the values in each column you are looking at. Not sure what you want to do with the data, but hopefully it helps.

from bs4 import BeautifulSoup
from urllib import request

page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read()
soup = BeautifulSoup(page)

desired_table = soup.findAll('table')[2]

# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
for th in headers:
    if 'SVENY' in th.string:
        desired_columns.append(headers.index(th))

# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')

for row in rows[1:]:
    cells= row.findAll('td')
    for column in desired_columns:
        print(cells[column].text)

In response to your second request:

from bs4 import BeautifulSoup
from urllib import request

page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read()
soup = BeautifulSoup(page)

desired_table = soup.findAll('table')[2]
data = {}

# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
column_count = 0
for th in headers:
    if 'SVENY' in th.string:
        data[th.string] = {'column': headers.index(th), 'data': []}
        column_count += 1

# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')

for row in rows[1:]:
    date = row.findAll('th')[0].text
    cells= row.findAll('td')

    for header,info in data.items():
        column_number = info['column']
        cell_data = [date,cells[column_number].text]
        info['data'].append(cell_data)

This returns a dictionary where each key is the header for a column, and each value is another dictionary that has 1) the column it's in on the site, and 2) the actual data you want, in a list of lists.

As an example:

for year_number in data['SVENY01']['data']:
    print(year_number)

['2015-06-05', '0.3487']
['2015-06-04', '0.3124']
['2015-06-03', '0.3238']
['2015-06-02', '0.3040']
['2015-06-01', '0.3009']
['2015-05-29', '0.2957']
etc.

You can fiddle around with this to get the info how and where you want it, but hopefully this is helpful.

edited Jun 10, 2015 at 15:08

answered Jun 9, 2015 at 18:34

Msg

1422 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user131983 Over a year ago

This is great! However, do you know how I could include the header and row Titles to these? And, also include all the rows?

user131983 Over a year ago

Thanks again. Unfortunately, I get an error in the line date = row.findAll('th')[0].text, which is IndexError: list index out of range. I am using Python 2.7 though and am hence using import urllib2 content = urllib2.urlopen(url).read() soup = BeautifulSoup(content) instead of using request. Could this be the issue?

Msg Over a year ago

Hm, that's interesting. All that date line is doing is, for the row of the table it's on, pulling the data from the th tag, like this one on the site: <th scope="row">2015-06-05</th> The index out of range error implies to me that row.findAll(th') is returning an empty list, which is strange. Does this occur on the first iteration?

Msg Over a year ago

If you continue to have the problem, I'd suggest making a new question for it with more info, since I think it would constitute as a separate issue.

Collectives™ on Stack Overflow

Trouble parsing HTML page with Python

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related