I want to scrape the Number of participants of the following news. The url is http://news.sina.com.cn/c/2013-07-11/175827642839.shtml And I want to get the Number 820. It is generated by javascript. How can I get that number using simple way?
1 Answer
You could analize javascript code and do the same in python. Or you can use Selenium in Python.
edit:
Here example from selenium page changed to do what you need.
It open browser (firefox), wait 5 second (to load page) and get text
#!/usr/bin/python
import selenium
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import time
browser = webdriver.Firefox() # Get local session of firefox
browser.get("http://news.sina.com.cn/c/2013-07-11/175827642839.shtml ") # Load page
time.sleep(5) # Let the page load
try:
element = browser.find_element_by_xpath("//span[contains(@class,'f_red')]") # get element on page
print element.text # get element text
except NoSuchElementException:
assert 0, "can't find f_red"
browser.close()
5 Comments
furas
I added example in my answer. It use Firefox to get what you need.
furas
Yesterday on page was 820. Today on page is 823. So today my example give 823 (
print element.text). Or I'm looking in wrong place.mjc
Yeah,the code is excellent,but it will open the FireFox browser.If I have millions of web page to scrape,it will not be effective. Can you have some tips for that?
furas
I heared that there is
webdriver.i_dont_remember_name which don't open any browser` but it still needs time.sleep to wait for javascript. For scraping I use urllib + pyQuery but it work only with HTML - so I get javascript, analyze what it is doing step by step, I look for source of information. If I find some url (mostly in ajax) I can try to use it directly in python. This way script can work fast enought to get milions pages (you can use threads to get more pages at the same time).furas
But sometimes script works too fast and server know that it have to be script or bot. Servers don't like bots - bots don't click advertisements and servers don't earn money :)