Python web scraping, skip url if error

Question

I'm trying to scrape one site (about 7000 links, all in a list), and because of my method, it is taking a LONG time and I guess that I'm ok with that (since that implies staying undetected). But if I do get any kind of error in trying to retrieve a page, can I just skip it?? Right now, if there's an error, the code breaks and gives me a bunch of error messages. Here's my code:

Collection is a list of lists and the resultant file. Basically, I'm trying to run a loop with get_url_data() (which I have a previous question to thank for) with all my url's in urllist. I have something called HTTPError but that doesn't seem to handle all the errors, hence this post. In a related side-quest, it would also be nice to get a list of the url's that couldn't process, but that's not my main concern (but it would be cool if someone could show me how).

Collection=[]
def get_url_data(url):

    try:
        r = requests.get(url, timeout=10)
        r.raise_for_status()

    except HTTPError:
        return None

    site = bs4.BeautifulSoup(r.text)
    groups=site.select('div.filters')
    word=url.split("/")[-1]

    B=[]
    for x in groups:
        B.append(word)
        T=[a.get_text() for a in x.select('div.blahblah [class=txt]')]
        A1=[a.get_text() for a in site.select('div.blah [class=txt]')]
        if len(T)==1 and len(A1)>0 and T[0]=='verb' and A1[0]!='as in':
            B.append(T)
            B.append([a.get_text() for a in x.select('div.blahblah [class=ttl]')])
            B.append([a.get_text() for a in x.select('div.blah [class=text]')])
            Collection.append(B)
        B=[]

for url in urllist:
    get_url_data(url)

I think the main error code was this, which triggered other ones Because there were a bunch of errors starting with During handling of the above exception, another exception occurred.

Traceback (most recent call last):
  File "C:\Python34\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 319, in _make_request
    httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'

What sort of errors are you usually getting? Are they all related to web scraping or are some from something else? Where are the errors being thrown? If you just expand your try except blocks you can probably fix the issue — Dan Oberlam
– Dan Oberlam, Commented Aug 15, 2014 at 1:36
@Dannnno I'll paste the error codes, but it's sort of long. I'll post what I think was the main one. — thatandrey
– thatandrey, Commented Aug 15, 2014 at 1:58
If you look at that you see that there is only a TypeError, not an HTTPError. If you expand your except block (such as except Exception) you should be able to catch all of those (although this is generally not great practice) — Dan Oberlam
– Dan Oberlam, Commented Aug 15, 2014 at 2:02
@Dannnno Thanks, then what's the best practice? Also, I didn't write some of this code, so what does return None mean? Is it stopping everything, or just skipping? If it's skipping, can I somehow store the url that caused the skip? It also seems like if I get an HTTPError, it's still running the rest of the code even though that won't do any good? — thatandrey
– thatandrey, Commented Aug 15, 2014 at 2:07
Best practice would be explicitly catching every exception you would expect to occur so anything unexpected is still noticed (except HTTPError, TypeError). return None returns a value of None to where the function was called - the function is not evaluated past that point. If you want to get the url in question you'd have to change your return statement to something like return url or you'd have to change the logic in your for loop — Dan Oberlam
– Dan Oberlam, Commented Aug 15, 2014 at 2:34

salmanwahed · Accepted Answer · 2014-08-15 07:15:13Z

5

You can make your try-catch block look like this,

try:
    r = requests.get(url, timeout=10)
    r.raise_for_status()

except Exception:
    return

The Exception class will handle all the errors and exception.

If you want to get the exception message you can print this in your except block. You have then instantiate exception first before raising it.

except Exception as e:
    print(e.message)
    return

edited Aug 15, 2014 at 7:15

answered Aug 15, 2014 at 5:41

salmanwahed

9,6878 gold badges37 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

thatandrey Over a year ago

Thanks, @Dannno pretty much said this too, but I like how you added the option to see the error message. When it prints the message, will it quit the program or will it continue?

salmanwahed Over a year ago

no it won't. it will print the error message in the console and the program will go on. you can also write it in a log file.

Sid Over a year ago

@salmanwahed is there a direct way to write it into a log file?

salmanwahed Over a year ago

Of course. Chek this: Logging to a file

Collectives™ on Stack Overflow

Python web scraping, skip url if error

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related