0

I'm crawling several URLs to find specific keywords in their source code. However, while crawling half of the websites, my spider suddenly stops due to HTTP errors like 404 or 503.

My crawler:

import urllib2

keyword = ['viewport']

with open('listofURLs.csv') as f:
    for line in f:
        strdomain = line.strip()
        if strdomain:
            req = urllib2.Request(strdomain.strip())
            response = urllib2.urlopen(req)
            html_content = response.read()

            for searchstring in keyword:
                if searchstring.lower() in str(html_content).lower():
                    print (strdomain, keyword, 'found')

f.close()

What code should I add to ignore bad URLs with HTTP errors and letting the crawler continue craping?

2
  • You can call getCode() on the response object and use a conditional to check for a 200 status. Commented Feb 20, 2017 at 23:33
  • @tobassist can you tell me what line of code i specifically need? Commented Feb 21, 2017 at 20:18

2 Answers 2

1

You can use a try-except block as demonstrated here. This allows you to apply your logic to the valid urls and apply different logic to the urls that give HTTP errors.

Applying the solution in the link to your code gives.

import urllib2

keyword = ['viewport']

with open('listofURLs.csv') as f:
    for line in f:
        strdomain = line.strip()
        if strdomain:
            req = urllib2.Request(strdomain.strip())
            try:
                response = urllib2.urlopen(req)
                html_content = response.read()

                for searchstring in keyword:
                    if searchstring.lower() in str(html_content).lower():
                        print (strdomain, keyword, 'found')

            except urllib2.HTTPError, err:
                # Do something here maybe print err.code
f.close()

This is the right solution for the code you've provided. However, eLRuLL makes a great point that you really should look at using scrapy for your web crawling needs.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you! Why is scrapy so much better than my code?
@jakeT888 scrapy contains all the tools and mechanisms to deal with most of the web crawling problems. On your case, it already handles bad response statuses without breaking your web crawler.
1

I would recommend using Scrapy framework for crawling purposes

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.