Scraping data from Highcharts using selenium

Question

I am trying to scrape data from highchart. I took a look at similar questions, but didn't understand how script_execute works or how could I detect js using my browser. Here is my current code:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

# Core settings
chrome_path = r"C:\Users\X\Y\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.implicitly_wait(15)

stats_url = 'https://addons.mozilla.org/en-US/firefox/addon/adblock-plus/statistics/'

driver.get(stats_url)
driver.find_element_by_link_text('by Source').click()
driver.find_element_by_id('custom-date-range').click()
year = driver.find_element_by_id('date-range-start')
year.click()
for i in range(5): # goes back 5 years
    year.send_keys(Keys.ARROW_DOWN)
driver.find_element_by_id('date-range-submit').click()

I want to scrape the "download" data from the graph, (not only for this page for many pages though). And when I use custom search option, csv file that automatically generated by the website is not updated. So only way is to scrape the data from the graph. How I could do it ?

The person that gave a negative vote could please provide the answer ? — edyvedy13
– edyvedy13, Commented Oct 20, 2017 at 16:15
Oh I am not really accustomed with selenium but let me edit it. Bu I am not sure if sleep time or find_element_by_xpath has something to do with the question that I asked — edyvedy13
– edyvedy13, Commented Oct 20, 2017 at 16:34
Also could you please tell me why 101 on how not to use xpath :D ? I am willing to learn — edyvedy13
– edyvedy13, Commented Oct 20, 2017 at 16:37
as for xpath, the 101 is: 1 - avoid using //* (it doesn't help with readability, and also is least performant of all options); 2 - avoid using place-based location (e.g. [1], [2], etc); 3 - limit xpath to meaningful elements, that represent something important, don't try to follow the whole subtree; 4 - don't use xpath, if element can be identified in a better way. So for example: find_element_by_xpath('//*[@id="side-nav"]/ul/li[2]/a') becomes find_element_by_partial_link_text("Downloads"). — timbre timbre
– timbre timbre, Commented Oct 20, 2017 at 17:43
Selenium has a method called find_element_by_id(). Use that instead. — Mangohero1
– Mangohero1, Commented Oct 20, 2017 at 17:48

Florent B. · Accepted Answer · 2017-10-23 13:32:22Z

6

+50

Mozilla provides a simple REST API to get the stats, so you don't need to use Selenium.

With the requests module:

url = "https://addons.mozilla.org/en-US/firefox/addon/adblock-plus/statistics/downloads-day-20170823-20171023.json"
data = requests.get(url).json()

To select the range, simply update the dates in the URL.

But if you are still willing to scrap the chart with selenium:

dates = driver.execute_script("return Highcharts.charts[0].series[0].xData");
users = driver.execute_script("return Highcharts.charts[0].series[0].yData");
downloads = driver.execute_script("return Highcharts.charts[0].series[1].yData");

answered Oct 23, 2017 at 13:32

Florent B.

42.7k7 gold badges92 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

edyvedy13 Over a year ago

Actually, I figured out that I could use REST API yesterday, but I think there is not enough documentation about scrapping highcharts. Could you please explain, how do you know the chart actually corresponds to Highcharts.charts[0] ? I couldn't find it by visualizing source page

Florent B. Over a year ago

Highcharts.charts[0] is the first chart in the page. To get the chart related to a DOM element: Highcharts.charts[document.querySelector("#head-chart").dataset["highchartsChart"]]

Davide Patti · Accepted Answer · 2018-12-21 08:34:02Z

I noticed one thing.

It seems true that:

"when I use custom search option, csv file that automatically generated by the website is not updated".

But actually it is not true. It is updated, but the maximum "custom data range" seems to be 1 year.

For example, if you set from 2013-09-23 to 2017-10-23 the .csv(.json) generated has max the data of 1 year (in this example from 22/10/2016 to 21/10/2017).

You can better notice this if you play with the "extremes".

For example with:

https://addons.mozilla.org/en-US/firefox/addon/adblock-plus/statistics/downloads-day-20131023-20141023.json

first element: {"date": "2014-10-23", "count": 212730, "end": "2014-10-23"}
last element: {"date": "2013-10-24", "count": 163094, "end": "2013-10-24"}

if you change with:

https://addons.mozilla.org/en-US/firefox/addon/adblock-plus/statistics/downloads-day-20131023-20141024.json

first element: {"date": "2014-10-24", "count": 215105, "end": "2014-10-24"}
last element: {"date": "2013-10-25", "count": 168018, "end": "2013-10-25"}

Or with:

https://addons.mozilla.org/en-US/firefox/addon/adblock-plus/statistics/downloads-day-20131022-20141023.json

will be again :

first element: {"date": "2014-10-23", "count": 212730, "end": "2014-10-23"}
last element: {"date": "2013-10-24", "count": 163094, "end": "2013-10-24"}

So, in order to get the data of the last 5 years you could do:

import subprocess
interestedYears=5;
year=1
today="2017-10-23"
tokenDataToday= today.split("-")
dateEnd=tokenDataToday[0]+tokenDataToday[1]+tokenDataToday[2]
url= "https://addons.mozilla.org/en-US/firefox/addon/adblock-plus/statistics/downloads-day-"

while year <= interestedYears:
     yearStart= str(int(float(tokenDataToday[0]))-year)
     dateStart=yearStart+tokenDataToday[1]+tokenDataToday[2]
     #print("dateStart: " + dateStart)
     #print("dateEnd: " + dateEnd)
     tmpUrl=url+dateStart+"-"+dateEnd+".csv"
     cmd = 'curl -O ' + tmpUrl
     print(cmd)
     args = cmd.split()
     process = subprocess.Popen(args, shell=False, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
     stdout, stderr = process.communicate()
     dateEnd=dateStart
     year = year+1
     print("-----------------------------")

Thank you very much for the answer, I somehow figured it out before I set a bounty for this question but also wanted to learn about how to scrape data from highcharts

Collectives™ on Stack Overflow

Scraping data from Highcharts using selenium

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related