Python Web Scraping - handling page 404 errors

Question

I am performing web scraping in via Python \ Selenium \ Chrome headless driver which involves executing a loop:

# perform loop

CustId=2000
while (CustId<=3000):
  

  # Part 1: Customer REST call:
  urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
  driver.get(urlg)

  soup = BeautifulSoup(driver.page_source,"lxml")

  dict_from_json = json.loads(soup.find("body").text)

  #logic for webscraping is here......

  CustId = CustId+1

  # close driver at end of everything

driver.close()

However, sometime the page might not exist when the customer ID is certain number. I have no control over this and the code stops with page not found 404 error. How do I ignore this though and just move on with the loop?

I'm guessing I need a TRY....EXCEPT though?

KunduK · Accepted Answer · 2022-04-21 17:00:56Z

You can check the page body h1 tag what the text appeared when it comes 404 error and then you can put that in if clause to check if not then go inside the block.

CustId=2000
while (CustId<=3000):
    urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
    driver.get(urlg)
    soup = BeautifulSoup(driver.page_source,"lxml")
    if not "Page not found" in soup.find("body").text:     
      dict_from_json = json.loads(soup.find("body").text)
      #logic for webscraping is here......

    CustId=CustId+1

Or

CustId=2000
while (CustId<=3000):
    urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
    driver.get(urlg)
    soup = BeautifulSoup(driver.page_source,"lxml")
    if not "404" in soup.find("body").text:     
      dict_from_json = json.loads(soup.find("body").text)
      #logic for webscraping is here......

    CustId=CustId+1

Ethan Turnbull · Accepted Answer · 2022-04-21 14:50:39Z

0

Maybe a way to do this would be to try:

try:
    urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
    driver.get(urlg)

    soup = BeautifulSoup(driver.page_source,"lxml")

    dict_from_json = json.loads(soup.find("body").text)

    #logic for webscraping is here......

    CustId = CustId+1
except:   
    print("404 error found, moving on")
    CustId = CustId+1

Sorry if this doesn't work, I havent tested it out.

answered Apr 21, 2022 at 14:50

Ethan Turnbull

137 bronze badges

1 Comment

HedgeHog Over a year ago

If you are not sure that your answer is correct it would be better to skip it for the moment and come up with it again if you are pretty sure - In addition how could you be sure that there is an 404 error in your try block. Couldn't it be that something completely different is happening and you wouldn't even notice it without an explicit check?

undetected Selenium · Accepted Answer · 2022-04-21 22:00:44Z

0

An ideal approach would be to use the range() function and driver.quit() at the end as follows:

for CustId in range(2000, 3000):
    try:
        urlg = f'https://mywebsite.com/customerRest/show/?id={str(CustId)}'
        driver.get(urlg)
        if not "404" in driver.page_source:
            soup = BeautifulSoup(driver.page_source,"lxml")
            dict_from_json = json.loads(soup.find("body").text)
            #logic for webscraping is here......
except:
        continue
driver.quit()

edited Apr 21, 2022 at 22:00

answered Apr 21, 2022 at 21:46

undetected Selenium

194k44 gold badges304 silver badges387 bronze badges

Collectives™ on Stack Overflow

Python Web Scraping - handling page 404 errors

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related