I am webscraping Glassdoor.com for company reviews using Python.
Currently, I am using Beautiful Soup and grequests. This is working fine for all the fields I need, except for the "Advice to Management" section which only loads in once the Continue Reading button is pressed. See below an example below for this page of reviews:
continue reading button expanded review
There are no changes to the URL as far as I can tell, but there is a JS click-event being fired in the console:
Event: EiReviews: Click [continueReading-71858088]
I found a tutorial online for selenium webdriver such as this one, and I wrote this code:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome (executable_path="C:\\chromedriver.exe")
driver.get("https://www.glassdoor.com/Reviews/Alteryx-Reviews-E351220.htm")
btn = driver.find_element(By.CLASS_NAME, "v2__EIReviewDetailsV2__continueReading").click()
driver.execute_script ("arguments[0].click();",btn)
I need something that scales better, as this takes ~20sec to open chrome and click on a singular button. I need to be able to click on every "Continue Reading" button on the page as my end goal is to scrape every review for ~1,000 companies.
<div id="Container">object, there is ascriptobject starting withwindow.appCache={....which contains the complete reviews but in a sort of a strange dictionary/json format, for example it contains the text which appears when you click on Continue Reading"summary":"Great place to work, been here 4+ years","summaryOriginal":null,"advice":"Don't rush too finish a project". Maybe you can extract everything from therewindow.appCachedicthas all the information I need.