0

I want to scrape the Number of participants of the following news. The url is http://news.sina.com.cn/c/2013-07-11/175827642839.shtml And I want to get the Number 820. It is generated by javascript. How can I get that number using simple way?

1
  • Sending Json to a python server? Commented Jul 14, 2013 at 1:02

1 Answer 1

1

You could analize javascript code and do the same in python. Or you can use Selenium in Python.

edit:

Here example from selenium page changed to do what you need.

It open browser (firefox), wait 5 second (to load page) and get text

#!/usr/bin/python

import selenium
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import time

browser = webdriver.Firefox() # Get local session of firefox
browser.get("http://news.sina.com.cn/c/2013-07-11/175827642839.shtml ") # Load page
time.sleep(5) # Let the page load
try:
    element = browser.find_element_by_xpath("//span[contains(@class,'f_red')]") # get element on page
    print element.text # get element text
except NoSuchElementException:
    assert 0, "can't find f_red"
browser.close()
Sign up to request clarification or add additional context in comments.

5 Comments

I added example in my answer. It use Firefox to get what you need.
Yesterday on page was 820. Today on page is 823. So today my example give 823 (print element.text). Or I'm looking in wrong place.
Yeah,the code is excellent,but it will open the FireFox browser.If I have millions of web page to scrape,it will not be effective. Can you have some tips for that?
I heared that there is webdriver.i_dont_remember_name which don't open any browser` but it still needs time.sleep to wait for javascript. For scraping I use urllib + pyQuery but it work only with HTML - so I get javascript, analyze what it is doing step by step, I look for source of information. If I find some url (mostly in ajax) I can try to use it directly in python. This way script can work fast enought to get milions pages (you can use threads to get more pages at the same time).
But sometimes script works too fast and server know that it have to be script or bot. Servers don't like bots - bots don't click advertisements and servers don't earn money :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.