Python Selenium to download HTML with Javascript components

Question

I am trying to create code that can scrape the reviews (Javascript generated component) on Urban Outfitters. Below I created the scraping code for a specific shoe on the website. However, the downloaded page source HTML does not contain the reviews. Does anyone know how to make Selenium download the HTML with the reviews included.

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import codecs
import os
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

br = webdriver.Safari()
br.maximize_window()
br.get('https://www.urbanoutfitters.com/shop/converse-chuck-taylor-all-star-canvas-platform-high-top-sneaker?category=SEARCHRESULTS&color=015&searchparams=q%3Dsneaker&type=REGULAR&quantity=1')
try:
    myElem = WebDriverWait(br, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'c-pwa-product-reviews__items-outer')))
    print("Page is ready!")
except TimeoutException:
    print("Loading took too much time!")
n=os.path.join(os.path.sep, "Users", "jenniferzhou", "Downloads","Page.html")
#open file in write mode with encoding
f = codecs.open(n, "w", "utf−8")
h = br.page_source
f.write(h)
br.quit()

Please take a look at this thread - stackoverflow.com/questions/45796411/… — Swaroop Humane
– Swaroop Humane, Commented Apr 12, 2021 at 18:51

ce.teuf · Accepted Answer · 2021-04-13 14:47:45Z

Disclaimer: respect the website, don't bombard the site with requests ;)

As the link provided by Swaroop Humane indicates, selenium is mainly useful for testing the mechanics of a website and not very effective for gathering data.

However, most of the time you don't have to run javascript, the engine of the websites does it itself very well. But you, as a user, have to make the right requests.

Without going into details, you must explore the data that passes between the client (you) and the server (F12 -> Network tab -> (Html, Xhr, etc ...)

So here is the code (commented) :

## import usefull libraries
import requests as rq 
from bs4 import BeautifulSoup as bs
from urllib.parse import unquote
import json

## set up initial header and initial request
headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0"}
url_base = "https://www.urbanoutfitters.com/shop/converse-chuck-taylor-all-star-canvas-platform-high-top-sneaker?category=SEARCHRESULTS&color=015&searchparams=q%3Dsneaker&type=REGULAR&quantity=1&reviewPage=13"
s = rq.session()
q_base = s.get(url_base, headers=headers)

## get the cookies from last request ; 
## since no request cookies was set, website does and return a bunch of interesting elements
d = q_base.cookies.get_dict()
d2 = d['urbn_auth_payload'] # this element (a stringed dictionnary) contains the element of interest => "authToken"
d3 = unquote(d2)
d4 = json.loads(d3)

# We rebuild here a second header, like the one observed in F12 Network tab
headers_2 = {
    "Host": "www.urbanoutfitters.com",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0",
    "Accept": "application/json, text/plain, */*",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "x-urbn-site-id": "uo-us",
    "x-urbn-channel": "web",
    "x-urbn-country": "FR",
    "x-urbn-currency": "USD",
    "x-urbn-language": "en-US",
    "x-urbn-experience": "ss",
    "x-urbn-primary-data-center-id": "US-PA",
    "x-urbn-geo-region": "EU-LN",
    "Connection": "keep-alive",
    "authorization": "Bearer " + d4["authToken"],
}

## the "url_target" returns data you want (again finded in F12 -> Network) ONLY when we 
## pass correct set of header arguments (=> "headers_2")
## Note that I set offset=3 & limit=100 and the end of url string. I a greater limit
## but the server return to me : 
## #b'{"code": "ERROR_PARAM_INVALID_LIMIT", "message": "Invalid limit value: 300, limit cannot be greater than 100"}'
url_target = "https://www.urbanoutfitters.com/api/catalog/v0/uo-us/product/converse-chuck-taylor-all-star-canvas-platform-high-top-sneaker/reviews?projection-slug=reviews&offset=3&limit=100"
q_target = s.get(url_target, headers=headers_2) # return json data
data = q_target.json() # parse json

count_review = data["product"]["reviewStatistics"]["totalReviewCount"] # number of reviews
data_reviews = data["results"] # reviews list


def extract_data(data_reviews):
    # choisir ici quels sont les éléments à extraire
    res = []
    for el in data_reviews:
        res.append([el['submissionTime'], el['userNickname'], el['title'], el['reviewText']    ])
    return res


resultat = extract_data(data_reviews) # data u want

"resultat" contains :

[['2021-04-11T14:13:18.000+00:00',
  'Ainsleyb',
  'Definitely recommend',
  'I love these shoes so much, you should definitely order these in your normal size. They go with everything and they make you outfit look even better. I’m also short so they definitely make me look taller.'],
 ['2021-04-10T22:27:00.000+00:00',
  'Marisol B',
  'Love them',
  'I love these shoes I just don’t know what to wear with them lol.'],
......
......

Note 1: I have not taken into account the case where there are more than 100 reviews on a product.

Note 2: the central part of the "url_target" must be adapted to extract reviews of other products

Collectives™ on Stack Overflow

Python Selenium to download HTML with Javascript components

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related