1

I try to webscrape with javascript dynamic + bs + python and Ive read a lot of things to come up with this code where I try to scrape a price rendered with javascript on a famous website for example:

from bs4 import BeautifulSoup
from selenium import webdriver

url = "https://www.nespresso.com/fr/fr/order/capsules/original/"

browser = webdriver.PhantomJS(executable_path = "C:/phantomjs-2.1.1-windows/bin/phantomjs.exe")
browser.get(url)
html = browser.page_source

soup = BeautifulSoup(html, 'lxml')

soup.find("span", {'class':'ProductListElement__price'}).text

But I only have as a result '\xa0' which is the source value, not the javascript value and I don't know really what I did wrong ...

Best regards

2 Answers 2

1

You don't need the expense of a browser. The info is in a script tag so you can regex that out and handle with json library

import requests, re, json

r = requests.get('https://www.nespresso.com/fr/fr/order/capsules/original/')
p = re.compile(r'window\.ui\.push\((.*ProductList.*)\)')
data = json.loads(p.findall(r.text)[0])
products = {product['name']:product['price'] for product in data['configuration']['eCommerceData']['products']}
print(products)

Regex:

enter image description here

Sign up to request clarification or add additional context in comments.

2 Comments

Hello, thanks for this information, can you explain how did you find the script tag ? I suspected to be in it but couldn't find it by inspecting elements.
I pulled back the response with jsoup which won't run javascript then searched the response for a product name /price off the actual webpage.
0

Here are two ways to get the prices

from bs4 import BeautifulSoup
from selenium import webdriver

url = "https://www.nespresso.com/fr/fr/order/capsules/original/"

browser = webdriver.Chrome()
browser.get(url)
html = browser.page_source

# Getting the prices using bs4
soup = BeautifulSoup(html, 'lxml')
prices = soup.select('.ProductListElement__price')
print([p.text for p in prices])

# Getting the prices using selenium 
prices =browser.find_elements_by_class_name("ProductListElement__price")
print([p.text for p in prices])

1 Comment

Oh thanks so the PhantomJS was the problem since the beginning ... shame on me. Big thanks !

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.