How can I web-scrape with Python when the HTML doesn't change?

Question

I'm currently using Selenium and BeautifulSoup to try to scrape financial statement data from Google Finance. For example:

http://www.google.com/finance?q=GOOG&fstype=ii

opens to Income Statement for Google. When I get Selenium to click the "Balance Statement" and "Cash Flow" buttons at the top of the page, the charts and tables on the page change, but the url doesn't change, and when I pull the page source, it is the original page with the Income Statement table. My code is posted below:

driver = webdriver.Firefox()
driver.get("http://www.google.com/finance?q=" + ticker[0] + "&fstype=ii")

url1 = driver.page_source
soup1 = BeautifulSoup(url1)

element = driver.find_element_by_xpath('//*[@id=":1"]/a/b/b')
element.click()

driver.implicity_wait(3.0)
url2 = driver.page_source
soup2 = BeautifulSoup(url2)

element = driver.find_element_by_xpath('//*[@id=":2"]/a/b/b')
element.click()

driver.implicity_wait(3.0)
url3 = driver.page_source
soup3 = BeautifulSoup(url3)

driver.quit()

Any help is appreciated. Thanks.

alecxe · Accepted Answer · 2014-07-14 00:05:38Z

3

You don't need BeautifulSoup HTML parser here. Selenium itself is powerful enough in navigating on the page and getting elements by almost everything you can imagine.

The table data you need is inside div elements with different ids. Activate each tab and get the data from an appropriate div.

Here's an example that prints out headers of the tables inside all of the tabs:

from selenium import webdriver

def print_header(element):
    table = element.find_element_by_id('fs-table')
    for row in table.find_elements_by_tag_name('th'):
        print row.text


driver = webdriver.Firefox()
driver.get('http://www.google.com/finance?q=GOOG&fstype=ii')

print_header(driver.find_element_by_id('incinterimdiv'))
print "----"

# activate Balance Sheet
element = driver.find_element_by_xpath('//*[@id=":1"]/a/b/b')
element.click()

print_header(driver.find_element_by_id('balinterimdiv'))
print "----"

# activate Cash Flow
element = driver.find_element_by_xpath('//*[@id=":2"]/a/b/b')
element.click()

print_header(driver.find_element_by_id('casinterimdiv'))

driver.quit()

Prints:

In Millions of USD (except for per share items)
3 months ending 2014-03-31
3 months ending 2013-12-31
3 months ending 2013-09-30
3 months ending 2013-06-30
3 months ending 2013-03-31
----
In Millions of USD (except for per share items)
As of 2014-03-31
As of 2013-12-31
As of 2013-09-30
As of 2013-06-30
As of 2013-03-31
----
In Millions of USD (except for per share items)
3 months ending 2014-03-31
12 months ending 2013-12-31
9 months ending 2013-09-30
6 months ending 2013-06-30
3 months ending 2013-03-31

answered Jul 14, 2014 at 0:05

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user2395969 Over a year ago

So would I add another for loop in the print_header function that would say something like: for col in table.find_elements_by_tag_name('td'): then save the results in a python object?

alecxe Over a year ago

@user2395969 you can find elements inside table, each tr etc - depends on what is your desired output. The point here is to use selenium only. Hope that helps.

Collectives™ on Stack Overflow

How can I web-scrape with Python when the HTML doesn't change?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related