2

I am trying to scrape the data from https://www.similarweb.com/website/zalando.de/#overview using Python and Selenium. The difficult part is that the data only appears when a point on the graph is hovered over.

Here's my code.

websites = ['https://www.similarweb.com/website/zalando.de/#overview']

    options = webdriver.ChromeOptions()
    options.add_argument('start-maximized')
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option("useAutomationExtension", False)

    browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
    delays = [7, 4, 6, 2, 10, 19]
    delay = np.random.choice(delays)
    for crawler in websites:
        browser.get(crawler)
        time.sleep(2)

        time.sleep(delay)
        
        tooltip = browser.find_element(By.XPATH, "//*[local-name() = 'svg']/*[local-name()='g'][8]/*[local-name()='text']")
        ActionChains(browser).move_to_element(tooltip).perform()
        month_value = browser.find_element(By.XPATH, "//*[local-name() = 'svg']/*[local-name()='g' and @class='highcharts-tooltip']/*[local-name()='text']")
        print('Are they here?', month_value.text)
        months = browser.find_elements(By.XPATH, "//*[local-name() = 'svg']/*[local-name()='g'][6]/*/*")
        for date in months:
            print(date.text)

I can print the months data as:

Nov '20
Dec '20
Jan '21
Feb '21
Mar '21
Apr '21

But not able to print the values of each month- it gives an empty print -Are they here?

How do I ensure that it is hovered first and then scraped? Please help

EDIT : Here's the updated code

def website_monitoring():
    websites = ['https://www.similarweb.com/website/zalando.de/#overview']

    options = webdriver.ChromeOptions()
    options.add_argument('start-maximized')
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option("useAutomationExtension", False)

    browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
    for crawler in websites:
        browser.get(crawler)
        wait = WebDriverWait(browser, 10)
        months = []
        monthly_values = []
        charts = wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="highcharts-0"]')))
        highchart = browser.find_elements_by_xpath('//*[@id="highcharts-0"]/svg/g[4]/g[1]')
        for elements in highchart:
            hover = ActionChains(browser).move_to_element(elements)
            hover.perform()
            month = browser.find_elements_by_css_selector('#highcharts-0 > svg > g.highcharts-tooltip > text > tspan:nth-child(1)')
            month_values = browser.find_elements_by_css_selector('#highcharts-0 > svg > g.highcharts-tooltip > text > tspan:nth-child(3)')
            months.append(month[0].text)
            monthly_values.append(month_values[0].text)
        print('Months', months)
        print('Monthly Values', monthly_values)


if __name__ == "__main__":
    website_monitoring()

The output that I get is:

Months []
Monthly Values []

2 Answers 2

1

When a site displays dynamic charts, it retrieves the underlying data from its databases or from external APIs. Then, the server sends this data, or makes this data available (Json, xml, plain, csv) for the graphical frameworks (d3js, highcharts...). Sometimes this data is integrated into the HTML by template engines or hard written in javascript files.

After some investigation, we see that here the data is stored in a script tag at the end of the html (See F12 -> Inspector). The variable that contains the data is preloadedData. It seems to contain all the data used in the animations of the page, including the one that interests you.

from selenium import webdriver
from bs4 import BeautifulSoup as bs
import time
import json
import re

driver = webdriver.Firefox()
driver.get("https://www.similarweb.com/website/zalando.de")

html = driver.page_source

soup = bs(html, "html.parser")

# get all scripts tags and select the one of interest
balises_script = soup.find_all("script")
target_balise = [str(el) for el in balises_script if "Sw.preloadedData" in str(el)][0]

# use regex to extract dict like string 
m = re.findall(r"Sw.preloadedData = (.+)\;", target_balise)[0]

# dict like string to dict
data = json.loads(m)

# explore data to see where data of interest is
sub_data_of_interest = data['overview']['EngagementsSimilarweb']['WeeklyTrafficNumbers']

for items in sub_data_of_interest.items():
    print(items)

driver.close()

which results in :

('2020-11-01', 29914593)
('2020-12-01', 27141507)
('2021-01-01', 26863605)
('2021-02-01', 22589520)
('2021-03-01', 24745220)
('2021-04-01', 26249414)

Note 1: Selenium is often misused, it is designed to test web pages, not to retrieve data. However it is sometimes easier to use this tool.

Note 2: I tried the classic requests + bs method, it's more complicated: the script tag that contains the data is generated by another javascript that uses a rimbambelle of cookies.

Note 3: Be careful, the site detects requests that are likely to be non-human (too fast). Think to put a time.sleep in your for loops (if you loop on several URL).

Sign up to request clarification or add additional context in comments.

Comments

0

This is a bit tricky. But I noticed something that I think will help: the info is present on the DOM regardless if it is on the page, and there is a unique css selector for it ('tspan:nth-child(3)'). The thing is, it is just one element that displays the value dynamically when you move the mouse. So if you identify which points you want to scrape the values from, but here's a quick way to print just the value I believe you want:

for point in points_to_hover:
    driver.find_element_by_css_selector('tspan:nth-child(3)').get_attribute("innerText")

15 Comments

what exactly is 'points_to_hover' in for loop?
An array of webelements you've defined consisting of each point on which you want to hover.
oh! the tooltip.!! got it
it says element not found...it is not working
@technophile_3 can you post the code from your latest attempt, and the error message you get?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.