1

can anybody help me with this? I have written a code to scrape articles from a Chinese news site, using Selenium. As many of the urls do not load, I tried to include code to catch timeout exceptions, which works but then the browser seems to stay on page which timed out when loading, rather than moving to try the next url.

I've tried adding driver.quit() and driver.close() after handling the error, but then it doesn't work when continuing to the next loop.

with open('url_list_XB.txt', 'r') as f:
    url_list = f.readlines()

for idx, url in enumerate(url_list):
    status = str(idx)+" "+str(url)
    print(status)

    try:
        driver.get(url)
        try:
            tblnks = driver.find_elements_by_class_name("post_topshare_wrap")
            for a in tblnks:
                html = a.get_attribute('innerHTML')
                try:
                    link = re.findall('href="http://comment(.+?)" title', str(html))[0]
                    tb_link = 'http://comment' + link
                    print(tb_link)
                    ID = tb_link.replace("http://comment.tie.163.com/","").replace(".html","")
                    print(ID)
                    with open('tb_links.txt', 'a') as p:
                        p.write(tb_link + '\n')
                    try:
                        text = str(driver.find_element_by_class_name("post_text").text)
                        headline = driver.find_element_by_tag_name('h1').text
                        date = driver.find_elements_by_class_name("post_time_source")
                        for a in date:
                            date = str(a.text)
                            dt = date.split(" 来源")[0]
                            dt2 = dt.replace(":", "_").replace("-", "_").replace(" ", "_")

                        count = driver.find_element_by_class_name("post_tie_top").text

                        with open('SCS_DATA/' + dt2 + '_' + ID + '_INT_' + count + '_WY.txt', 'w') as d:
                            d.write(headline)
                            d.write(text + '\n')
                        path = 'SCS_DATA/' + ID
                        os.mkdir(path)

                    except NoSuchElementException as exception:
                        print("Element not found ")
                except IndexError as g:
                    print("Index Error")


            node = [url, tb_link]
            results.append(node)

        except NoSuchElementException as exception:
            print("TB link not found ")
        continue


    except TimeoutException as ex:
        print("Page load time out")

    except WebDriverException:
        print('WD Exception')

I want to the code to move through a list of urls, calling them and grabbing the article text as well as a link to the discussion page. It works until a page times out on loading, then the programme will not move on.

1
  • you need to catch timeout on the .get() call...pageloadtimeout in some drivers is very long by default...I think it was like 20 minutes for gecko. After that it'll throw timeout, but not sure you caught that one on the .get()... you can set the timeout period: driver.manage().timeouts().pageLoadTimeout(20, TimeUnit.SECONDS); OR, you can set the driver to not wait for pageload at all. (or only wait for localDOM) See PageLoadStrategy: selenium.dev/selenium/docs/api/java/org/openqa/selenium/… (some drivers take enum for value and some string...) Commented Nov 7, 2019 at 23:43

1 Answer 1

2

I can't exactly understand what your code is doing because I have no context for the page you are automating, but I can provide a general structure for how you would accomplish something like this. Here's a simplified version of how I would handle your scenario:

# iterate URL list
for url in url_list:

    # navigate to a URL
    driver.get(url)

    # check something here to test if a link is 'broken' or not
    try: 
        driver.find_element(someLocator)

    # if link is broken, go back
    except TimeoutException:
        driver.back()
        # continue so we can return to beginning of loop
        continue

    # if you reach this point, the link is valid, and you can 'do stuff' on the page

This code navigates to the URL, and performs some check (that you specify) to see if the link is 'broken' or not. We check for broken link by catching the TimeoutException that gets thrown. If the exception is thrown, we navigate to the previous page, then call continue to return to the beginning of the loop, and start over with the next URL.

If we make it through the try / except block, then the URL is valid and we are on the correct page. In this place, you can write your code to scrape the articles or whatever you need to do.

The code the appears after try / except will ONLY be hit if TimeoutException is NOT encountered -- meaning the URL is valid.

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you very much, that's helpful. I wasn't aware of the driver.back() command, and will try including that in the handling of the exception.
driver.back() usually works but every now and then, it doesn't actually take you back to the correct page -- this line can work in its place if not: driver.execute_script("window.history.go(-1)")
I think this would only work if pageloadstrategy is set to none and a webdriverwait is used on the findElement call... otherwise the .get() call would wait until the page loads to readystate, or the timeout period is reached and exception thrown.
It kind of depends on what a "broken" URL means. Does "broken" mean no website exists, and nothing will ever load? If so, I agree with what you say about the PageLoadStrategy -- but, does "broken" just mean the website is not the web site that was expected? In that case, you'd take the findElement approach, because some page will load, just not the right one. Some more info from the original asker on how broken links behave would be helpful.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.