Create DataFrame from Scraped Javascript Table List

Question

I've used Selenium to scrape a dynamic Javascript table of Federal employee position and salary info from http://www.fedsdatacenter.com/federal-pay-rates/index.php?n=&l=&a=SECURITIES+AND+EXCHANGE+COMMISSION&o=&y=all. (Note: It's all public domain data, so no worries re: personal information).

I'm trying to get it into a Pandas DF for analysis. My problem is that my Selenium input data is a list that prints as:

[u'DOE,JON'], [u'14'], [u'SK'], [u'$176,571.00'], [u'$2,000.00'], [u'SECURITIES AND EXCHANGE COMMISSION'], [u'WASHINGTON'], [u'GENERAL ATTORNEY'], [u'2012']], ...

What I want to get to is a DF that handles an arbitrary number of records as:

NAME     GRADE SCALE SALARY      BONUS     AGENCY LOCATION POSITION YEAR
Doe, Jon 14    SK    $176,571.00 $2,000.00 SEC    DC       ATTY     2012
.
.
.

I've tried converting this list to a dictionary, using the zip() function with the col names as a tuple and the data as a list, etc., all to no avail, though it's been a good tour of Python's features. What should the next step be after getting the data or should I be reading the data in a different way?

Currently, the scraper code is:

from selenium import webdriver

path_to_chromedriver = '/Users/xxx/Documents/webdriver/chromedriver' # change path as needed
browser = webdriver.Chrome(executable_path = path_to_chromedriver)

url = 'http://www.fedsdatacenter.com/federal-pay-rates/index.php'
browser.get(url)

inputAgency = browser.find_element_by_id('a')
inputYear = browser.find_element_by_id('y')

# Send data
inputAgency.send_keys('SECURITIES AND EXCHANGE COMMISSION')
inputYear.send_keys('All')

# Select 'All' from Years element
browser.find_element_by_css_selector('input[type=\"submit\"]').click()
browser.find_element_by_xpath('//*[@id="example_length"]/label/select/option[4]').click()

SMRtable = browser.find_element_by_id('example')

scrapedData = []

for td in SMRtable.find_elements_by_xpath('.//td'):
    scrapedData.append([td.get_attribute('innerHTML')])
    print td.get_attribute('innerHTML')

jezrael · Accepted Answer · 2015-12-22 11:57:18Z

You can use pandas only.

So first you can check view page source of webpage:

http://www.fedsdatacenter.com/federal-pay-rates/index.php?n=&l=&a=SECURITIES+AND+EXCHANGE+COMMISSION&o=&y=all

Check lines no. 14807 - 14826:

// data table initialization
$(document).ready(function() {
    $('#example').dataTable( {
        "bPaginate": true,
        "bFilter": false,
        "bProcessing": true,
        "bServerSide": true,
        "aoColumns": [
        null,
        null,
        null,
        { "sType": 'currency' }, // set currency columns to allow sorting
        { "sType": 'currency' }, // set second column to currency to allow sorting
        null,
        null,
        null,
        null
        ],
        "sAjaxSource": "output.php?n=&a=SECURITIES AND EXCHANGE COMMISSION&l=&o=&y=all"
    } );
} );

It means this page use dataTables and data are loaded from ajax source as JSON.

So instead of scrapping html, you can get clean nice json:

output.php?n=&a=SECURITIES AND EXCHANGE COMMISSION&l=&o=&y=all

And final link is (instead space use %20):

http://www.fedsdatacenter.com/federal-pay-rates/output.php?n=&a=SECURITIES%20AND%20EXCHANGE%20COMMISSION&l=&o=&y=all

JSON:

{"sEcho":0,"iTotalRecords":"7072900","iTotalDisplayRecords":"19919","aaData":[
["ZUVER,SHAHEEN H","14","SK","$170,960.00","$0.00","SECURITIES AND EXCHANGE COMMISSION","WASHINGTON","GENERAL ATTORNEY","2014"],
["ZUR,MIA C.","14","SK","$164,875.00","$0.00","SECURITIES AND EXCHANGE COMMISSION","WASHINGTON","GENERAL ATTORNEY","2014"],
["ZUNDEL,JENNET LEONG","14","SK","$204,638.00","$0.00","SECURITIES AND EXCHANGE COMMISSION","SAN FRANCISCO","ACCOUNTING","2014"],
["ZUKOWSKI,DAVID W","04","SK","$38,382.00","$0.00","SECURITIES AND EXCHANGE COMMISSION","BOSTON","ADMIN AND OFFICE SUPPORT STUDENT TRAINEE","2014"],
...

So you can parse this json by pandas with read_json:

import pandas as pd

df = pd.read_json("http://www.fedsdatacenter.com/federal-pay-rates/output.php?n=&a=SECURITIES%20AND%20EXCHANGE%20COMMISSION&l=&o=&y=all")
print df.head()
                                              aaData  iTotalDisplayRecords  \
0  [ZUVER,SHAHEEN H, 14, SK, $170,960.00, $0.00, ...                 19919   
1  [ZUR,MIA C., 14, SK, $164,875.00, $0.00, SECUR...                 19919   
2  [ZUNDEL,JENNET LEONG, 14, SK, $204,638.00, $0....                 19919   
3  [ZUKOWSKI,DAVID W, 04, SK, $38,382.00, $0.00, ...                 19919   
4  [ZOU,FAN, 14, SK, $166,650.00, $0.00, SECURITI...                 19919   

   iTotalRecords  sEcho  
0        7072900      0  
1        7072900      0  
2        7072900      0  
3        7072900      0  
4        7072900      0

Then you get new DataFrame from column aaData - use list comprehension:

df1 = pd.DataFrame([ x for x in df['aaData'] ])

Set columns names:

df1.columns = ['NAME','GRADE','SCALE','SALARY','BONUS','AGENCY','LOCATION','POSITION','YEAR']

print df1.head()
                  NAME GRADE SCALE       SALARY  BONUS  \
0      ZUVER,SHAHEEN H    14    SK  $170,960.00  $0.00   
1           ZUR,MIA C.    14    SK  $164,875.00  $0.00   
2  ZUNDEL,JENNET LEONG    14    SK  $204,638.00  $0.00   
3     ZUKOWSKI,DAVID W    04    SK   $38,382.00  $0.00   
4              ZOU,FAN    14    SK  $166,650.00  $0.00   

                               AGENCY       LOCATION  \
0  SECURITIES AND EXCHANGE COMMISSION     WASHINGTON   
1  SECURITIES AND EXCHANGE COMMISSION     WASHINGTON   
2  SECURITIES AND EXCHANGE COMMISSION  SAN FRANCISCO   
3  SECURITIES AND EXCHANGE COMMISSION         BOSTON   
4  SECURITIES AND EXCHANGE COMMISSION     WASHINGTON   

                                   POSITION  YEAR  
0                          GENERAL ATTORNEY  2014  
1                          GENERAL ATTORNEY  2014  
2                                ACCOUNTING  2014  
3  ADMIN AND OFFICE SUPPORT STUDENT TRAINEE  2014  
4         INFORMATION TECHNOLOGY MANAGEMENT  2014

That's great, thanks! Need to get a better grasp of Javascript there too.
Actually, found one further limitation that implies scraping might still be necessary - while the "iTotalDisplayRecords":"19919", the actual dataframe that results from this contains only 100 rows, corresponding with the maximum option of 100 rows that the row select element allows. Know of any way around this?
you can try this url http://www.fedsdatacenter.com/federal-pay-rates/output.php?n=&a=SECURITIES%20AND%20EXCHANGE%20COMMISSION&l=&o=&y=all&sEcho=4&iColumns=9&sColumns=&iDisplayStart=0&iDisplayLength=100000 and maybe try change last number 100000

Collectives™ on Stack Overflow

Create DataFrame from Scraped Javascript Table List

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related