I'm trying to scrape data from a React chart on a website Link using selenium. I'm able to locate the element, but not getting the data. The specific data I need from that chart resides within a nested series:
"data":[{"name":"December 2019",
"....",
"coverage":107.9}
within the element <script id=react_5X8YGgN8H0GoMMQ4RLqjrQ </script>
The final result should look like this, extracted from data.name and data.coverage:
months = [December 2019, Januari 2020, Februari 2020, etc.]
coverages = [107.9, 107.8, 107.2, etc.]
Some code so far:
from selenium import webdriver
url = 'https://www.aholddelhaizepensioen.nl/over-ons/financiele-situatie/beleidsdekkingsgraad'
website = url
driver = webdriver.Firefox()
driver.get(website)
time.sleep(4)
driver.find_element_by_id("react_5X8YGgN8H0GoMMQ4RLqjrQ")
Solution 2
Since chitown88 states that the script tag is static, i.e. no need for selenium as requests can do the trick, here's another solution that got the data I need.
import requests
import BeautifulSoup as bs4
import pandas as pd
# Fetch site data
url = 'https://www.aholddelhaizepensioen.nl/over-ons/financiele-situatie/beleidsdekkingsgraad'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = bs4(r.content, 'html.parser')
# Find script
script_data = soup.find('script', attrs={'id':'react_5X8YGgN8H0GoMMQ4RLqjrQ'})
script_to_string = str(script) # cast to string for regex
# Regex
coverage_pattern = r'(?<="coverage":)\d{2,3}.\d{1}' #positive lookup, find everything after "coverage": with 2 or 3 numbers, a dot, and another number
months_pattern = r'(?<="name":")\w+\s\d{4}' #same as coverage_pattern, now based on word followed by four digits
# Data
coverages = re.findall(coverage_pattern,script_to_string)
months = re.findall(months_pattern,scrip_to_string)
frame = pd.DataFrame({'months':months,'coverages':coverages})
id. You selectdiv. Trydriver.find_element_by_xpath("//script[@id='react_5X8YGgN8H0GoMMQ4RLqjrQ']")