1

I'm trying to scrape data from a React chart on a website Link using selenium. I'm able to locate the element, but not getting the data. The specific data I need from that chart resides within a nested series:

"data":[{"name":"December 2019",
            "....",
            "coverage":107.9}

within the element <script id=react_5X8YGgN8H0GoMMQ4RLqjrQ </script>

The final result should look like this, extracted from data.name and data.coverage:

months = [December 2019, Januari 2020, Februari 2020, etc.]
coverages = [107.9, 107.8, 107.2, etc.]

Some code so far:

from selenium import webdriver

url = 'https://www.aholddelhaizepensioen.nl/over-ons/financiele-situatie/beleidsdekkingsgraad'
website = url
driver = webdriver.Firefox()
driver.get(website)
time.sleep(4)
driver.find_element_by_id("react_5X8YGgN8H0GoMMQ4RLqjrQ")

Solution 2

Since chitown88 states that the script tag is static, i.e. no need for selenium as requests can do the trick, here's another solution that got the data I need.

import requests
import BeautifulSoup as bs4
import pandas as pd

# Fetch site data
url = 'https://www.aholddelhaizepensioen.nl/over-ons/financiele-situatie/beleidsdekkingsgraad'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = bs4(r.content, 'html.parser')

# Find script
script_data = soup.find('script', attrs={'id':'react_5X8YGgN8H0GoMMQ4RLqjrQ'})
script_to_string = str(script) # cast to string for regex

# Regex
coverage_pattern = r'(?<="coverage":)\d{2,3}.\d{1}' #positive lookup, find everything after "coverage": with 2 or 3 numbers, a dot, and another number
months_pattern = r'(?<="name":")\w+\s\d{4}' #same as coverage_pattern, now based on word followed by four digits

# Data
coverages = re.findall(coverage_pattern,script_to_string)
months = re.findall(months_pattern,scrip_to_string)
frame = pd.DataFrame({'months':months,'coverages':coverages})
1
  • There are couple elements with the same id. You select div. Try driver.find_element_by_xpath("//script[@id='react_5X8YGgN8H0GoMMQ4RLqjrQ']") Commented Feb 5, 2021 at 11:27

1 Answer 1

1

Actually no need to use selenium as the data is embedded in the script tags of the static response. Just need to pull it out, manipulate the string a bit to get into json format, then read that in. Then just a matter of iterating through it:

import pandas as pd
import json
import requests
from bs4 import BeautifulSoup

url = 'https://www.aholddelhaizepensioen.nl/over-ons/financiele-situatie/beleidsdekkingsgraad'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

scripts = soup.find_all('script')
for script in scripts:
    if 'coverage' in script.text:
        jsonStr = script.text
        break

jsonStr = jsonStr.split('Section, ')[-1]

loop = True
while loop == True:
    try:
        jsonData = json.loads(jsonStr + '}')
        loop = False
    except:
        jsonStr = jsonStr.rsplit('}',1)[0]
 
data = jsonData['data']['data']
months = []
coverages = []

for each in data:
    months.append(each['name'])
    coverages.append(each['coverage'])

Output:

print(months)
['December 2019', 'Januari 2020', 'Februari 2020', 'Maart 2020', 'April 2020', 'Mei 2020', 'Juni 2020', 'Juli 2020', 'Augustus 2020', 'September 2020', 'Oktober 2020', 'November 2020']

and

print(coverages)
[107.9, 107.8, 107.2, 106.1, 105.1, 104.3, 103.7, 103.0, 102.8, 102.3, 101.9, 101.6]
Sign up to request clarification or add additional context in comments.

1 Comment

I accepted your answer as a solution since it guided me to the right direction. Thank you for that. However in my original question I added another solution, which is less extensive in code and results in the same outcome. So others can pick the solution they like if they encounter a similar problem.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.