4

On the site, there are a couple of links at the top labeled 1, 2, 3, and next. If a link labeled by a number is pressed, it dynamically loads in some data into a content div. If next is pressed, it goes to a page with labels 4, 5, 6, next and the data for page 4 is shown.

I want to scrape the data from the content div for all links pressed (I don't know how many there are, it just shows 3 at a time and next)

Please give an example of how to do it. For instance, consider the site www.cnet.com.

Please guide me to download the series of pages using selenium and parse them to handle with beautiful soup on my own.

2
  • 1
    Selenium has good tutorials, it would be an excellent place to start. Commented Dec 28, 2011 at 1:47
  • dm03514 is right, this is maybe not the right place to ask such a general question. Commented Dec 28, 2011 at 1:51

1 Answer 1

11

General layout (not tested):

#!/usr/bin/env python
from contextlib import closing
from selenium.webdriver import Firefox # pip install selenium

url = "http://example.com"

# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
    n = 1
    while n < 10:
        browser.get(url) # load page
        link = browser.find_element_by_link_text(str(n))
        while link:
           browser.get(link.get_attribute("href")) # get individual 1,2,3,4 pages
           #### save(browser.page_source)
           browser.back() # return to page that has 1,2,3,next -like links
           n += 1
           link = browser.find_element_by_link_text(str(n))

        link = browser.find_element_by_link_text("next")
        if not link: break
        url = link.get_attribute("href")
Sign up to request clarification or add additional context in comments.

9 Comments

The post was helpfull but i need to find the element by the class name .
@user1118534: update your question and specify what links at the top labeled "1", "2", "3", and "next" means in your case (if you're unsure then just post the html of the link: <a href="..." ...>...</a>). You could use browser.find_element_by_class_name(classname) to find an element by its class name.
am learning to scrape web sites that use java script as a part of learning currently i would like to scrape the editor reviews and user reviews for all the HP laptops in the website www.cnet.com. follow the steps to go to the desired page. go to www.cnet.com then click on reviews and then go to laptops and then view all brands. select the HP check box and for each laptop in all the pages like 1,2,3,4,.... on the top scraping the editor and user reviews is my goal. i would be very gratful if you can guide me in doing this
@koushik: 1. make sure that their TOS allows such use. 2. to go to 3rd page you could use: link = browser.find_element_by_link_text("3"); link.click(). To get reviews save browser.page_source for each 1,2,3,4,5, etc pages and parse them for links later. 3. It might be simpler just to use RSS or API instead of scraping if available.
thank you very much. i will try this out and if i have any thing else to ask i will get back to you. thank you very much
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.