0

I'm trying to scrape one site (about 7000 links, all in a list), and because of my method, it is taking a LONG time and I guess that I'm ok with that (since that implies staying undetected). But if I do get any kind of error in trying to retrieve a page, can I just skip it?? Right now, if there's an error, the code breaks and gives me a bunch of error messages. Here's my code:

Collection is a list of lists and the resultant file. Basically, I'm trying to run a loop with get_url_data() (which I have a previous question to thank for) with all my url's in urllist. I have something called HTTPError but that doesn't seem to handle all the errors, hence this post. In a related side-quest, it would also be nice to get a list of the url's that couldn't process, but that's not my main concern (but it would be cool if someone could show me how).

Collection=[]
def get_url_data(url):

    try:
        r = requests.get(url, timeout=10)
        r.raise_for_status()

    except HTTPError:
        return None

    site = bs4.BeautifulSoup(r.text)
    groups=site.select('div.filters')
    word=url.split("/")[-1]

    B=[]
    for x in groups:
        B.append(word)
        T=[a.get_text() for a in x.select('div.blahblah [class=txt]')]
        A1=[a.get_text() for a in site.select('div.blah [class=txt]')]
        if len(T)==1 and len(A1)>0 and T[0]=='verb' and A1[0]!='as in':
            B.append(T)
            B.append([a.get_text() for a in x.select('div.blahblah [class=ttl]')])
            B.append([a.get_text() for a in x.select('div.blah [class=text]')])
            Collection.append(B)
        B=[]

for url in urllist:
    get_url_data(url)

I think the main error code was this, which triggered other ones Because there were a bunch of errors starting with During handling of the above exception, another exception occurred.

Traceback (most recent call last):
  File "C:\Python34\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 319, in _make_request
    httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'
6
  • What sort of errors are you usually getting? Are they all related to web scraping or are some from something else? Where are the errors being thrown? If you just expand your try except blocks you can probably fix the issue Commented Aug 15, 2014 at 1:36
  • @Dannnno I'll paste the error codes, but it's sort of long. I'll post what I think was the main one. Commented Aug 15, 2014 at 1:58
  • If you look at that you see that there is only a TypeError, not an HTTPError. If you expand your except block (such as except Exception) you should be able to catch all of those (although this is generally not great practice) Commented Aug 15, 2014 at 2:02
  • @Dannnno Thanks, then what's the best practice? Also, I didn't write some of this code, so what does return None mean? Is it stopping everything, or just skipping? If it's skipping, can I somehow store the url that caused the skip? It also seems like if I get an HTTPError, it's still running the rest of the code even though that won't do any good? Commented Aug 15, 2014 at 2:07
  • Best practice would be explicitly catching every exception you would expect to occur so anything unexpected is still noticed (except HTTPError, TypeError). return None returns a value of None to where the function was called - the function is not evaluated past that point. If you want to get the url in question you'd have to change your return statement to something like return url or you'd have to change the logic in your for loop Commented Aug 15, 2014 at 2:34

1 Answer 1

5

You can make your try-catch block look like this,

try:
    r = requests.get(url, timeout=10)
    r.raise_for_status()

except Exception:
    return

The Exception class will handle all the errors and exception.

If you want to get the exception message you can print this in your except block. You have then instantiate exception first before raising it.

except Exception as e:
    print(e.message)
    return
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks, @Dannno pretty much said this too, but I like how you added the option to see the error message. When it prints the message, will it quit the program or will it continue?
no it won't. it will print the error message in the console and the program will go on. you can also write it in a log file.
@salmanwahed is there a direct way to write it into a log file?
Of course. Chek this: Logging to a file

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.