How to ignore HTTP Errors while scraping URLs with python 2.7

Question

I'm crawling several URLs to find specific keywords in their source code. However, while crawling half of the websites, my spider suddenly stops due to HTTP errors like 404 or 503.

My crawler:

import urllib2

keyword = ['viewport']

with open('listofURLs.csv') as f:
    for line in f:
        strdomain = line.strip()
        if strdomain:
            req = urllib2.Request(strdomain.strip())
            response = urllib2.urlopen(req)
            html_content = response.read()

            for searchstring in keyword:
                if searchstring.lower() in str(html_content).lower():
                    print (strdomain, keyword, 'found')

f.close()

What code should I add to ignore bad URLs with HTTP errors and letting the crawler continue craping?

You can call getCode() on the response object and use a conditional to check for a 200 status. — tobassist
– tobassist, Commented Feb 20, 2017 at 23:33
@tobassist can you tell me what line of code i specifically need? — jakeT888
– jakeT888, Commented Feb 21, 2017 at 20:18

Community · Accepted Answer · 2017-05-23 12:09:24Z

1

You can use a try-except block as demonstrated here. This allows you to apply your logic to the valid urls and apply different logic to the urls that give HTTP errors.

Applying the solution in the link to your code gives.

import urllib2

keyword = ['viewport']

with open('listofURLs.csv') as f:
    for line in f:
        strdomain = line.strip()
        if strdomain:
            req = urllib2.Request(strdomain.strip())
            try:
                response = urllib2.urlopen(req)
                html_content = response.read()

                for searchstring in keyword:
                    if searchstring.lower() in str(html_content).lower():
                        print (strdomain, keyword, 'found')

            except urllib2.HTTPError, err:
                # Do something here maybe print err.code
f.close()

This is the right solution for the code you've provided. However, eLRuLL makes a great point that you really should look at using scrapy for your web crawling needs.

edited May 23, 2017 at 12:09

CommunityBot

11 silver badge

answered Feb 21, 2017 at 20:55

tobassist

1732 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

jakeT888 Over a year ago

Thank you! Why is scrapy so much better than my code?

eLRuLL Over a year ago

@jakeT888 scrapy contains all the tools and mechanisms to deal with most of the web crawling problems. On your case, it already handles bad response statuses without breaking your web crawler.

eLRuLL · Accepted Answer · 2017-02-20 23:34:06Z

1

I would recommend using Scrapy framework for crawling purposes

answered Feb 20, 2017 at 23:34

eLRuLL

18.8k9 gold badges79 silver badges106 bronze badges

Collectives™ on Stack Overflow

How to ignore HTTP Errors while scraping URLs with python 2.7

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related