0

I'm trying to get a hold of the data under the columns having the code "SEVNYXX", where "XX" are the numbers that follow (eg. 01, 02, etc) on the site http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html using Python. I am currently using the following method as prescribed by the site http://docs.python-guide.org/en/latest/scenarios/scrape/ . However, I don't know how to determine the divs for this page and am hence unable to proceed and was hoping to get some help with this.

This is what I have so far:

from lxml import html
import requests
page = requests.get('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html')
tree = html.fromstring(page.text)

Thank You

5
  • 1
    What do you have so far? Commented Jun 9, 2015 at 17:12
  • @PatrickRoberts Sorry, just added that. Commented Jun 9, 2015 at 17:16
  • Does it need to be python? It seems the page is static, and if you simply copy/paste the table to a spread sheet, you can easily extract the columns manually. That might be easier. Processing HTML with xpath is not the easiest thing to conquer. Commented Jun 9, 2015 at 17:22
  • @GerardvanHelden Thank You. However, if the page is updated can't I then simply re-download the data through my code? Is there an easier way to process HTML than using xpath? Commented Jun 9, 2015 at 17:31
  • If it is indeed dynamic then you do need some kind of scripting :) BeautifulSoup would have been my next recommendation. Commented Jun 12, 2015 at 18:23

1 Answer 1

1

Have you tried using BeautifulSoup? I'm a pretty big fan. Using that you can easily iterate through all of the info you want, searching by tag.

Here's something I threw together, that prints out the values in each column you are looking at. Not sure what you want to do with the data, but hopefully it helps.

from bs4 import BeautifulSoup
from urllib import request

page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read()
soup = BeautifulSoup(page)

desired_table = soup.findAll('table')[2]

# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
for th in headers:
    if 'SVENY' in th.string:
        desired_columns.append(headers.index(th))

# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')

for row in rows[1:]:
    cells= row.findAll('td')
    for column in desired_columns:
        print(cells[column].text)

In response to your second request:

from bs4 import BeautifulSoup
from urllib import request

page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read()
soup = BeautifulSoup(page)

desired_table = soup.findAll('table')[2]
data = {}

# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
column_count = 0
for th in headers:
    if 'SVENY' in th.string:
        data[th.string] = {'column': headers.index(th), 'data': []}
        column_count += 1

# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')

for row in rows[1:]:
    date = row.findAll('th')[0].text
    cells= row.findAll('td')

    for header,info in data.items():
        column_number = info['column']
        cell_data = [date,cells[column_number].text]
        info['data'].append(cell_data)

This returns a dictionary where each key is the header for a column, and each value is another dictionary that has 1) the column it's in on the site, and 2) the actual data you want, in a list of lists.

As an example:

for year_number in data['SVENY01']['data']:
    print(year_number)

['2015-06-05', '0.3487']
['2015-06-04', '0.3124']
['2015-06-03', '0.3238']
['2015-06-02', '0.3040']
['2015-06-01', '0.3009']
['2015-05-29', '0.2957']
etc.

You can fiddle around with this to get the info how and where you want it, but hopefully this is helpful.

Sign up to request clarification or add additional context in comments.

4 Comments

This is great! However, do you know how I could include the header and row Titles to these? And, also include all the rows?
Thanks again. Unfortunately, I get an error in the line date = row.findAll('th')[0].text, which is IndexError: list index out of range. I am using Python 2.7 though and am hence using import urllib2 content = urllib2.urlopen(url).read() soup = BeautifulSoup(content) instead of using request. Could this be the issue?
Hm, that's interesting. All that date line is doing is, for the row of the table it's on, pulling the data from the th tag, like this one on the site: <th scope="row">2015-06-05</th> The index out of range error implies to me that row.findAll(th') is returning an empty list, which is strange. Does this occur on the first iteration?
If you continue to have the problem, I'd suggest making a new question for it with more info, since I think it would constitute as a separate issue.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.