8

I am trying to scrape data from highchart. I took a look at similar questions, but didn't understand how script_execute works or how could I detect js using my browser. Here is my current code:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

# Core settings
chrome_path = r"C:\Users\X\Y\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.implicitly_wait(15)

stats_url = 'https://addons.mozilla.org/en-US/firefox/addon/adblock-plus/statistics/'

driver.get(stats_url)
driver.find_element_by_link_text('by Source').click()
driver.find_element_by_id('custom-date-range').click()
year = driver.find_element_by_id('date-range-start')
year.click()
for i in range(5): # goes back 5 years
    year.send_keys(Keys.ARROW_DOWN)
driver.find_element_by_id('date-range-submit').click()

I want to scrape the "download" data from the graph, (not only for this page for many pages though). And when I use custom search option, csv file that automatically generated by the website is not updated. So only way is to scrape the data from the graph. How I could do it ?

12
  • The person that gave a negative vote could please provide the answer ? Commented Oct 20, 2017 at 16:15
  • 1
    Oh I am not really accustomed with selenium but let me edit it. Bu I am not sure if sleep time or find_element_by_xpath has something to do with the question that I asked Commented Oct 20, 2017 at 16:34
  • 1
    Also could you please tell me why 101 on how not to use xpath :D ? I am willing to learn Commented Oct 20, 2017 at 16:37
  • 1
    as for xpath, the 101 is: 1 - avoid using //* (it doesn't help with readability, and also is least performant of all options); 2 - avoid using place-based location (e.g. [1], [2], etc); 3 - limit xpath to meaningful elements, that represent something important, don't try to follow the whole subtree; 4 - don't use xpath, if element can be identified in a better way. So for example: find_element_by_xpath('//*[@id="side-nav"]/ul/li[2]/a') becomes find_element_by_partial_link_text("Downloads"). Commented Oct 20, 2017 at 17:43
  • 1
    Selenium has a method called find_element_by_id(). Use that instead. Commented Oct 20, 2017 at 17:48

2 Answers 2

6
+50

Mozilla provides a simple REST API to get the stats, so you don't need to use Selenium.

With the requests module:

url = "https://addons.mozilla.org/en-US/firefox/addon/adblock-plus/statistics/downloads-day-20170823-20171023.json"
data = requests.get(url).json()

To select the range, simply update the dates in the URL.

But if you are still willing to scrap the chart with selenium:

dates = driver.execute_script("return Highcharts.charts[0].series[0].xData");
users = driver.execute_script("return Highcharts.charts[0].series[0].yData");
downloads = driver.execute_script("return Highcharts.charts[0].series[1].yData");
Sign up to request clarification or add additional context in comments.

2 Comments

Actually, I figured out that I could use REST API yesterday, but I think there is not enough documentation about scrapping highcharts. Could you please explain, how do you know the chart actually corresponds to Highcharts.charts[0] ? I couldn't find it by visualizing source page
Highcharts.charts[0] is the first chart in the page. To get the chart related to a DOM element: Highcharts.charts[document.querySelector("#head-chart").dataset["highchartsChart"]]
4

I noticed one thing.

It seems true that:

"when I use custom search option, csv file that automatically generated by the website is not updated".

But actually it is not true. It is updated, but the maximum "custom data range" seems to be 1 year.

For example, if you set from 2013-09-23 to 2017-10-23 the .csv(.json) generated has max the data of 1 year (in this example from 22/10/2016 to 21/10/2017).

You can better notice this if you play with the "extremes".

For example with:

https://addons.mozilla.org/en-US/firefox/addon/adblock-plus/statistics/downloads-day-20131023-20141023.json
  • first element: {"date": "2014-10-23", "count": 212730, "end": "2014-10-23"}
  • last element: {"date": "2013-10-24", "count": 163094, "end": "2013-10-24"}

if you change with:

https://addons.mozilla.org/en-US/firefox/addon/adblock-plus/statistics/downloads-day-20131023-20141024.json
  • first element: {"date": "2014-10-24", "count": 215105, "end": "2014-10-24"}
  • last element: {"date": "2013-10-25", "count": 168018, "end": "2013-10-25"}

Or with:

https://addons.mozilla.org/en-US/firefox/addon/adblock-plus/statistics/downloads-day-20131022-20141023.json

will be again :

  • first element: {"date": "2014-10-23", "count": 212730, "end": "2014-10-23"}
  • last element: {"date": "2013-10-24", "count": 163094, "end": "2013-10-24"}

So, in order to get the data of the last 5 years you could do:

import subprocess
interestedYears=5;
year=1
today="2017-10-23"
tokenDataToday= today.split("-")
dateEnd=tokenDataToday[0]+tokenDataToday[1]+tokenDataToday[2]
url= "https://addons.mozilla.org/en-US/firefox/addon/adblock-plus/statistics/downloads-day-"

while year <= interestedYears:
     yearStart= str(int(float(tokenDataToday[0]))-year)
     dateStart=yearStart+tokenDataToday[1]+tokenDataToday[2]
     #print("dateStart: " + dateStart)
     #print("dateEnd: " + dateEnd)
     tmpUrl=url+dateStart+"-"+dateEnd+".csv"
     cmd = 'curl -O ' + tmpUrl
     print(cmd)
     args = cmd.split()
     process = subprocess.Popen(args, shell=False, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
     stdout, stderr = process.communicate()
     dateEnd=dateStart
     year = year+1
     print("-----------------------------")

1 Comment

Thank you very much for the answer, I somehow figured it out before I set a bounty for this question but also wanted to learn about how to scrape data from highcharts

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.