Problem with scraping multiple pages with selenium webdriver - python

Question

I am trying to scrape a webpage and the links within that webpage. The webpage is: https://webgate.ec.europa.eu/rasff-window/screen/list . If you notice there are about 6000+ notifications and these notifications have separate links associated with them. I want to store all the links in a list. I am doing this using this code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

import time

from webdriver_manager.chrome import ChromeDriverManager


d = webdriver.Chrome(ChromeDriverManager().install())

#trying this scraping for multiple pages
links = []
i = 1
elems = d.find_elements_by_xpath("//a[@href]")
for elem in elems:
    link_list = elem.get_attribute("href")
    links.append(link_list)

while True:

  print("This is the now the {} page".format(i))
  i +=1
  time.sleep(1)
  try:
    time.sleep(0.5)
    WebDriverWait(d, 10).until(EC.element_to_be_clickable((By.XPATH, "//button[@aria-label='Next page']"))).click()
    print("we have clicked it once")
    time.sleep(0.9)
    
    elems2 = d.find_elements_by_xpath("//a[@href]")
    for elem2 in elems2:
        link_list = elem2.get_attribute("href")
        links.append(link_list)
    print("The button is clickable")
    time.sleep(1)
  except:
    print("The button is now not clickable, we have collected all the links")
    break

The idea is to use selenium to first find all the href links from that page and click on the next page button and do the same, which my While loop does. But as I run this code it does not complete the entire loop. For ex: If there are about 6400 notifications I expect it to run till the 64th page, but it stops in the middle suggesting that the next button is not clickable (except condition) though the button in reality is clickable. This happens on random pages, I have tried changing the time.sleep as well. Is there something wrong that I am doing?

you should check what you get in HTML in broswser. Element can be not-clickable when it is hidden by other element (like popup message) or it is not visible in window (it needs to scroll window using JavaScript or using ActionChain). You should observe when you get error and see what you have in browser in this moment. — furas
– furas, Commented Aug 4, 2021 at 13:19
the biggest mistake is except: - you should rather use except Exception as ex: print(ex) to see what is really the problem. — furas
– furas, Commented Aug 4, 2021 at 13:21

furas · Accepted Answer · 2021-08-04 13:54:02Z

1

I checked message from exception

except Exception as ex: 
     print(ex)

and it shows that problem is not button but href

It seems that sometimes it gets references to <a> before JavaScript updates all elements on page - and next when it tries to get href from <a> then error shows that this <a> doesn't exist on page because meanwhile JavaScript removed it and put new <a>.

And checking if button is clickable can be useless because it exists all time.

You should rather sleep longer before getting <a>. Or you would find better method to detect if you get new references or the same as before.

answered Aug 4, 2021 at 13:54

furas

149k12 gold badges121 silver badges171 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Dawn.Sahil Over a year ago

Thanks @furas, after seeing the exception, I got to know what my actual problem was, it was the stale element issue which you have rightly pointed out in the answer. Increasing the time.sleep did help me solve the issue. I will accept the answer.

Collectives™ on Stack Overflow

Problem with scraping multiple pages with selenium webdriver - python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related