How to scrape a dynamically loaded JavaScript page in Python?

Question

Bottom line up front: I want to scrape the jobs from this website: https://www.gdit.com/careers/search/?q=bossier%20city, but I keep getting the javascript base page. If you inspect the page, you can see the jobs are listed in h3 tags but no matter what I do, the jobs don't pull up.

I tried the following beautiful soup code:

url = "https://www.gdit.com/careers/search/?q=bossier%20city"
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, "html.parser")
print(soup) #  for testing purposes
for job in soup.find_all('h3'):
     print(job)

I tried ScraperAPI which I thought was supposed to load javascript for you:

url = "https://www.gdit.com/careers/search/?q=bossier%20city"
params = {'api_key': "MY-API-KEY", 'url': url}
response = requests.get('http://api.scraperapi.com/', params=params)
print(response.text) #  No H3 tags of any kind

I tried html-requests:

session = HTMLSession()
r = session.get("https://www.gdit.com/careers/search/?q=bossier%20city")
data = r.html.render()
print(data)

I tried Selenium first and then parsing it to beautifulsoup:

global driver
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("detach", True)
options.add_experimental_option('useAutomationExtension', False)
try:
    driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Users\Notebook\Documents\chromedriver.exe')
    driver.get(url)
    page_source = driver.page_source
    soup = BeautifulSoup(page_source, "html.parser")
    time.sleep(2)
    print(soup)
except exceptions.WebDriverException:
    print("You need to download a new version of the Chromedriver.")

Nothing works. Do I have to mimic a user entering Bossier City first and then retrieve the return? Anyways, any help would be appreciated.

I can't. Stack limits you to only 30,000 lines and the full DOM is 477K lines of code. It pulls up all the HTML, Javascript, and CSS for the whole site. — Brandon Jacobson
– Brandon Jacobson, Commented Oct 17, 2021 at 15:16

DisappointedByUnaccountableMod · Accepted Answer · 2021-10-18 07:54:27Z

1

I would suggest switching from BeautifulSoup (static loader, purely python based) to Selenium (dynamic loader, integrates into multiple webbrowsers like: chrome, firefox, etc, etc).

Learn More Here

Selenium is used for Automation testing on websites, however it can be used to scrape advanced dynamic websites.

it provides many features, from reading DOM values, to adding/remove or editing DOM elements, also you can wait for an element to come into existance by waiting for that element to appear or render.

driver.page_source only loads the base html and not the dynamic javascript. if you just print(driver.page_source) you will see what data is avaliable, adding time.sleep(10)

edited Oct 18, 2021 at 7:54

DisappointedByUnaccountableMod

6,8444 gold badges21 silver badges23 bronze badges

answered Oct 17, 2021 at 15:08

X3R0

6,4152 gold badges19 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Brandon Jacobson Over a year ago

I did try Selenium and it didn't work. Please see #4 in my post.

X3R0 Over a year ago

@BrandonJacobson don't use BS with Selenium, use either one, not both.

X3R0 Over a year ago

@BrandonJacobson driver.page_source only loads the base html and not the dynamic javascript. if you just print(driver.page_source) you will see what data is avaliable

X3R0 Over a year ago

perhaps the driver.page_source isn't loaded fully yet, perhaps when using Selenium, wait for the document ready load event happens.

Brandon Jacobson Over a year ago

You're right. I put in a 10 second long time.sleep(10) and it worked. Thanks!

|

Nizar · Accepted Answer · 2021-10-17 15:20:50Z

0

I think your problem is simple. As you said this page is loading elements dynamically using JS.

Selenium simply waits for the html to load. And does not wait for any scripts to finish running.

In order to wait for a specific element all you have to do is add this functionality to your code (Selenium supports this). Here's a great post explaining this. This post explains how you can wait for a specific element to become intractable which is one step further than -what I'm guessing- you require.

answered Oct 17, 2021 at 15:20

Nizar

7636 silver badges19 bronze badges

Collectives™ on Stack Overflow

How to scrape a dynamically loaded JavaScript page in Python?

2 Answers 2

7 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related