0

I have a problem with identifying an exception.

Im writing a scraper that scrapes a lot of different websites, and some errors I want to handle and some I only want to ignore.

I except my exceptions like this:

except Exception as e:

most of the exceptions I can identify like this:

type(e).__name__ == "IOError"

But I have one exception "[Errno 10054] An existing connection was forcibly closed by the remote host"

that has the name "error" which is too vague and Im guessing other errors also have that name. Im guessing I can somehow get the errno number from my exception and thus identify it. But I don't know how.

1 Answer 1

1

First, you should not rely on the exception's class name, but on the class itself - two classes from two different modules can have the same value for the __name__ attribute while being different exceptions. So what you want is:

try:
    something_that_may_raise()
except IOError as e:
     handle_io_error(e)
except SomeOtherError as e:
     handle_some_other_error(e)

etc...

Then you have two kind of exceptions: the one that you can actually handle one way or another, and the other ones. If the program is only for your personal use, the best way to handle "the other ones" is usually to not handle them at all - the Python runtime will catch them, display a nice traceback with all relevant informations (so you know what happened and where and can eventually add some handling for this case).

If it's a "public" program and/or if you do have some things to clean up before the program crash, you can add a last "catch all" except clause at the program's top level that will log the error and traceback somewhere so it isn't lost (logging.exception is your friend), clean what has to be cleaned and terminate with a more friendly error message.

There are very few cases where one would really want to just ignore an exception (I mean pretending nothing wrong or unexpected happened and happily continue). At the very least you will want to notify the user one of the actions failed and why - in your case that might be a top-level loop iterating over a set of sites to scrap, with an inner try/except block catching "expected" error cases, ie:

# config: 
config = [
   # ('url', {params})
   ('some.site.tld', {"param1" : value1, "param2" : value2}),
   ('some.other.tld', {"param1" : value1, "answer" : 42}),
   # etc
   ]

def run():
    for url, params in config:
        try:
            results = scrap(url, **params)

        except (SomeKnownError, SomeOtherExceptedException) as e:
            # things that are to be expected and mostly harmless
            #
            # you configured your logger so that warnings only
            # go to stderr
            logger.warning("failed to scrap %s : %s - skipping", url, e)
        except (MoreSeriousError, SomethingIWannaKnowAbout) as e:
            # things that are more annoying and you want to know
            # about but that shouldn't prevent from continuing 
            # with the remaining sites
            #
            # you configured your logger so that exceptions goes
            # to both stderr and your email.
            logger.exception("failed to scrap %s : %s - skipping", url, e)
        else:
            do_something_with(results)

Then have a top-level handler around the call to run() that takes care of unexpected errors :

def main(argv):
    parse_args() 
    try:
        set_up_everything()
        run()
        return 0
    except Exception as e:
        logger.exception("oops, something unexpected happened : %s", e)
        return 1
    finally:
        do_some_cleanup()

if __name__ == "__main__":
    sys.exit(main(sys.argv))

Note that the logging module has an SMTPHandler - but since mail can easily fail too you'd better still have a reliable log (stderr and tee to a file ?) locally. The logging module takes some time to learn but it really pays off in the long run.

Sign up to request clarification or add additional context in comments.

1 Comment

When an exception occurs I have two cases, either I want to ignoreit, or I want to send an email to myself with all the info (stacktrace etc) and I have lots of exceptions I ignore: IOerrors, timeout, UrlError, HtttpError,SSLerror, Timeoutexception And I would like to add this exception to that list. My program is basically a loop that scrapes around 15 webpages every 5 minutes, and the errors above I just want to ignore (since a page might be down for example) but the others I want to be notified about

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.