Python requests waiting for js scripts to render

Question

I am developing a code to scrape well-known ecommerce web sites. The code works, but it can't read the js scripts. I always manage to see up to 10 products when I know there are more than 40. I would need a python request that waits for the rendering of the page to scrape. I don't know English, this was translated with Google Translate. I apologize.

Codes I have tried:

    import requests, random
    from django.shortcuts import render
    from bs4 import BeautifulSoup
    from requests_html import HTMLSession, AsyncHTMLSession


    # Walmart - Create your views here.
    def wlista(request):

        buscarprods = request.COOKIES['buscarprod']
        url = 'https://www.walmart.com/search?q=hp+printers'  
        url = url.replace(" ", "%20")
        proveedor = request.COOKIES['proveedor'] 

        HDRS = {
           'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',,
           'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
           'Accept-Language': 'es-ES;es;q=0.8',
           'DNT': '1',
           'Connection': 'keep-alive',
           'Upgrade-Insecure-Requests': '1',
        }

        # This works but doesn't wait for javascript
        session_object = requests.Session()
        r = session_object.get(url, headers=HDRS).text

        # This works too, but doesn't wait for javascript
        #r = requests.get(url, headers=HDRS, timeout=(8.05, 35)).content




        #Here: I have error with "r.html.render(sleep=2)":
        # Error:
        # "There is no current event loop in thread 'Thread-1 (process_request_thread)'."

        #s  = HTMLSession()
        #r = s.get(url, headers=HDRS)
        #r.html.render(sleep=2)




        soup = BeautifulSoup(r, "html.parser")
        rows = soup.find_all(attrs={"data-item-id": True})

Thanks for the help!

Requests fetches the HTML as initially served, none of the client-side javascript will run so you'll never get the content that's loaded after the fact. You'll need to use something like selenium to automate a full browser instead — PangolinPaws
– PangolinPaws, Commented Sep 15, 2022 at 13:55
I tried, but in selenium you have to choose a navigator, isn't?. The Goal is the users can make the search — Javier L. Camacaro
– Javier L. Camacaro, Commented Sep 15, 2022 at 14:58

PangolinPaws · Accepted Answer · 2022-09-15 15:07:31Z

2

You'll need something besides requests if you want to access JavaScript-generated content. I'd suggest Selenium/Chromedriver from https://chromedriver.chromium.org/downloads

This example finds 56 printers:

import os
from bs4 import BeautifulSoup
from selenium import webdriver

dirname, scriptname = os.path.split(os.path.abspath(__file__))
THIS_DIRECTORY = f'{dirname}{os.sep}'
HEADLESS = False
DRIVER = None

def load_page(url):
    '''Load the specified page with an automated browser'''
    global DRIVER
    if DRIVER is None:
        options = webdriver.ChromeOptions()
        options.headless = HEADLESS
        options.add_experimental_option('excludeSwitches', ['enable-automation'])
        options.add_experimental_option('excludeSwitches', ['enable-logging'])
        options.add_experimental_option('useAutomationExtension', False)

        DRIVER = webdriver.Chrome(options=options, executable_path=f'{THIS_DIRECTORY}chromedriver.exe')
    DRIVER.get(url)


def main(url):
    load_page(url)
    soup = BeautifulSoup(DRIVER.page_source, 'html.parser')
    rows = soup.find_all(attrs={"data-item-id": True})
    print(len(rows))
    for row in rows:
        # Do stuff
        pass

if __name__ == '__main__':
    main('https://www.walmart.com/search?q=hp+printers')

Output:

answered Sep 15, 2022 at 15:07

PangolinPaws

6244 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Javier L. Camacaro Over a year ago

Thanks looks fine, I did try, but get this errors: "DeprecationWarning: executable_path has been deprecated, please pass in a Service object", and this: "Message: 'chromedriver.exe' executable needs to be in PATH. Please see chromedriver.chromium.org/home"

PangolinPaws Over a year ago

You'll need a copy of the chromedriver executable from the link at the top of my answer, save it in the same directory as your python script.

Collectives™ on Stack Overflow

Python requests waiting for js scripts to render

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related