0

I tried to crawl the reviews in the websites. For 1 website, it runs fine. however when I create a loop to crawl in many websites, it throws an error raise

TimeoutException(message, screen, stacktrace) TimeoutException

I tried to increase the waiting time from 30 to 50 now but it still does not run fine. here is my code :

import requests
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from datetime import datetime

start_time = datetime.now()

result = pd.DataFrame()
df = pd.read_excel(r'D:\check_bols.xlsx')
ids = df['ids'].values.tolist() 

link = "https://www.bol.com/nl/ajax/dataLayerEndpoint.html?product_id="

for i in ids:
    
    link3 = link + str(i[-17:].replace("/",""))
    op = webdriver.ChromeOptions()
    op.add_argument('--ignore-certificate-errors')
    op.add_argument('--incognito')
    op.add_argument('--headless')
    driver = webdriver.Chrome(executable_path='D:/chromedriver.exe',options=op)
    driver.get(i)
    WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-test='consent-modal-confirm-btn']>span"))).click()
    WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.review-load-more__button.js-review-load-more-button"))).click()

    soup = BeautifulSoup(driver.page_source, 'lxml')

    product_attributes = requests.get(link3).json()

    reviewtitle = [i.get_text() for i in soup.find_all("strong", class_="review__title") ]

    url = [i]*len(reviewtitle)

    productid = [product_attributes["dmp"]["productId"]]*len(reviewtitle)
  
    content= [i.get_text().strip()  for i in soup.find_all("div",attrs={"class":"review__body"})]
    
    author = [i.get_text() for i in soup.find_all("li",attrs={"data-test":"review-author-name"})]

    date  = [i.get_text() for i in soup.find_all("li",attrs={"data-test":"review-author-date"})]

    output = pd.DataFrame(list(zip(url, productid,reviewtitle, author, content, date )))
    
    result.append(output)
    
    result.to_excel(r'D:\bols.xlsx', index=False)
    
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))

Here are some links that I tried to crawl :

link1 link2

3
  • 1
    Which line errors? - the timeout occurs when the webdriverwait fails and it's instead of a nosuchelement - Validate that object exists on the link that fails. For example, your second wait is "a.review-load-more__button.js-review-load-more-button" - So that's the "load more button", but what if there is no button for this? What if there are no review or no more reviews to load? (it will timeout finding it) Commented Dec 11, 2020 at 10:18
  • The line error is the 2nd wait like you mentioned. How can i fix this error ? Some of links has more than 5 reviews so it needs to click the button, some does not. What should I adjust the code that works in both situations ? Commented Dec 11, 2020 at 10:29
  • 1
    Catch the timeout error with a Try and except -w3schools.com/python/python_try_except.asp - You'll also need to import that error at the start of your script. Commented Dec 11, 2020 at 10:34

2 Answers 2

1

As mentioned in the comments - your timing out because you're looking for a button that does not exist.

You need to catch the error(s) and skip those failling lines. You can do this with a try and except.

I've put together an example for you. It's hard coded to one url (as I don't have your data sheet) and it's a fixed loop with purpose to keep TRYING to click the "show more" button, even after it's gone.

With this solution be careful of your sync time. EACH TIME the WebDriverWait is called it will wait that full duration if it does not exist. You'll need to exit the expand loop when done (first time you trip the error) and keep your sync time tight - or it will be a slow script

First, add these to your imports:

from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import StaleElementReferenceException

Then this will run and not error:

#not a fixed url:
driver.get('https://www.bol.com/nl/p/Matras-180x200-7-zones-koudschuim-premium-plus-tijk-15-cm-hard/9200000130825457/')

#accept the cookie once
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-test='consent-modal-confirm-btn']>span"))).click()
   
for i in range(10):
    try:
        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.review-load-more__button.js-review-load-more-button"))).click()
        print("I pressed load more")
    except (TimeoutException, StaleElementReferenceException):
        pass
        print("No more to load - but i didn't fail")

The output to the console is this:

DevTools listening on ws://127.0.0.1:51223/devtools/browser/4b1a0033-8294-428d-802a-d0d2127c4b6f

I pressed load more

I pressed load more

No more to load - but i didn't fail

No more to load - but i didn't fail

No more to load - but i didn't fail

No more to load - but i didn't fail (and so on).

This is how my browser looks - Note the size of the scroll bar for the link I used - it looks like it's got all the reviews: enter image description here

Sign up to request clarification or add additional context in comments.

Comments

1

I would suggest Use Infinite While loop and use try..except block. If element found it will click on the element else statement will go to the except block and exit from while loop.

driver.get("https://www.bol.com/nl/p/Matras-180x200-7-zones-koudschuim-premium-plus-tijk-15-cm-hard/9200000130825457/")
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-test='consent-modal-confirm-btn']>span"))).click()
while True:
    try:
        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.review-load-more__button.js-review-load-more-button"))).click()
        print("Lode more button found and clicked ")
    except:
        print("No more load more button available on the page.Please exit...")
        break

Your console output will display like below.

Lode more button found and clicked 
Lode more button found and clicked 
Lode more button found and clicked 
Lode more button found and clicked 
No more load more button available on the page.Please exit...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.