Using python with selenium to scrape dynamic web pages

Question

On the site, there are a couple of links at the top labeled 1, 2, 3, and next. If a link labeled by a number is pressed, it dynamically loads in some data into a content div. If next is pressed, it goes to a page with labels 4, 5, 6, next and the data for page 4 is shown.

I want to scrape the data from the content div for all links pressed (I don't know how many there are, it just shows 3 at a time and next)

Please give an example of how to do it. For instance, consider the site www.cnet.com.

Please guide me to download the series of pages using selenium and parse them to handle with beautiful soup on my own.

Selenium has good tutorials, it would be an excellent place to start. — dm03514
– dm03514, Commented Dec 28, 2011 at 1:47
dm03514 is right, this is maybe not the right place to ask such a general question. — Niklas B.
– Niklas B., Commented Dec 28, 2011 at 1:51

jfs · Accepted Answer · 2011-12-28 05:09:55Z

11

General layout (not tested):

#!/usr/bin/env python
from contextlib import closing
from selenium.webdriver import Firefox # pip install selenium

url = "http://example.com"

# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
    n = 1
    while n < 10:
        browser.get(url) # load page
        link = browser.find_element_by_link_text(str(n))
        while link:
           browser.get(link.get_attribute("href")) # get individual 1,2,3,4 pages
           #### save(browser.page_source)
           browser.back() # return to page that has 1,2,3,next -like links
           n += 1
           link = browser.find_element_by_link_text(str(n))

        link = browser.find_element_by_link_text("next")
        if not link: break
        url = link.get_attribute("href")

answered Dec 28, 2011 at 5:09

jfs

417k210 gold badges1k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Koushik Over a year ago

The post was helpfull but i need to find the element by the class name .

jfs Over a year ago

@user1118534: update your question and specify what links at the top labeled "1", "2", "3", and "next" means in your case (if you're unsure then just post the html of the link: <a href="..." ...>...</a>). You could use browser.find_element_by_class_name(classname) to find an element by its class name.

Koushik Over a year ago

am learning to scrape web sites that use java script as a part of learning currently i would like to scrape the editor reviews and user reviews for all the HP laptops in the website www.cnet.com. follow the steps to go to the desired page. go to www.cnet.com then click on reviews and then go to laptops and then view all brands. select the HP check box and for each laptop in all the pages like 1,2,3,4,.... on the top scraping the editor and user reviews is my goal. i would be very gratful if you can guide me in doing this

jfs Over a year ago

@koushik: 1. make sure that their TOS allows such use. 2. to go to 3rd page you could use: link = browser.find_element_by_link_text("3"); link.click(). To get reviews save browser.page_source for each 1,2,3,4,5, etc pages and parse them for links later. 3. It might be simpler just to use RSS or API instead of scraping if available.

Koushik Over a year ago

thank you very much. i will try this out and if i have any thing else to ask i will get back to you. thank you very much

|

Collectives™ on Stack Overflow

Using python with selenium to scrape dynamic web pages

1 Answer 1

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related