Capturing HTTP Errors using scrapy

Question

I'm trying to scrape a website for broken links, so far I have this code which is successfully logging in and crawling the site, but it's only recording HTTP status 200 codes:

class HttpStatusSpider(scrapy.Spider):
    name = 'httpstatus'
    handle_httpstatus_all = True

    link_extractor = LinkExtractor()

    def start_requests(self):
        """This method ensures we login before we begin spidering"""
        # Little bit of magic to handle the CSRF protection on the login form
        resp = requests.get('http://localhost:8000/login/')
        tree = html.fromstring(resp.content)
        csrf_token = tree.cssselect('input[name=csrfmiddlewaretoken]')[0].value

        return [FormRequest('http://localhost:8000/login/', callback=self.parse,
                            formdata={'username': 'mischa_cs',
                                      'password': 'letmein',
                                      'csrfmiddlewaretoken': csrf_token},
                            cookies={'csrftoken': resp.cookies['csrftoken']})]

    def parse(self, response):
        item = HttpResponseItem()
        item['url'] = response.url
        item['status'] = response.status
        item['referer'] = response.request.headers.get('Referer', '')
        yield item

        for link in self.link_extractor.extract_links(response):
            r = Request(link.url, self.parse)
            r.meta.update(link_text=link.text)
            yield r

The docs and these answers lead me to believe that handle_httpstatus_all = True should cause scrapy to pass errored requests to my parse method, but so far I've not been able to capture any.

I've also experimented with handle_httpstatus_list and a custom errback handler in a different iteration of the code.

What do I need to change to capture the HTTP error codes scrapy is encountering?

please remove the allowed_domains argument, it isn't needed and it could also filter your requests, maybe that's the problem — eLRuLL
– eLRuLL, Commented Dec 17, 2018 at 19:13
I removed the allowed_domains = ['localhost'] with no change in behaviour — chrisbunney
– chrisbunney, Commented Dec 17, 2018 at 19:21
I put the allowed_domains = ['localhost'] back in, after the spider ended up finding its way onto tripadvisor: 2018-12-17 19:29:09 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186364-c31-zfp5-Sheffield_South_Yorkshire_England.html> — chrisbunney
– chrisbunney, Commented Dec 17, 2018 at 19:30
ok, so now we are facing another problem? Please check my answer — eLRuLL
– eLRuLL, Commented Dec 17, 2018 at 19:32

eLRuLL · Accepted Answer · 2018-12-17 19:20:22Z

1

handle_httpstatus_list can be defined on the spider level, but handle_httpstatus_all can only be defined on the Request level, including it on the meta argument.

I would still recommend using an errback for these cases, but if everything is controlled, it shouldn't create new problems.

answered Dec 17, 2018 at 19:20

eLRuLL

18.8k9 gold badges79 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

chrisbunney Over a year ago

Ah, very interesting. That's an easily overlooked difference, and I can now see 4xx codes being captured. Not sure that the 5xxs are getting captured though. Next step is to try an errback

eLRuLL Over a year ago

Glad I helped you get the http requests you needed.

chrisbunney · Accepted Answer · 2018-12-18 11:04:42Z

1

So, I don't know if this is the proper scrapy way, but it does allow me to handle all HTTP status codes (including 5xx).

I disabled the HttpErrorMiddleware by adding this snippet to my scrapy project's settings.py:

SPIDER_MIDDLEWARES = {
    'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': None
}

answered Dec 18, 2018 at 11:04

chrisbunney

5,9567 gold badges52 silver badges70 bronze badges

1 Comment

eLRuLL Over a year ago

sure I would say it is a good solution, but of course just for your project, I don't think this could be recommended in a project with a lot of spiders where we only need to disable this for some spiders or even requests.

Collectives™ on Stack Overflow

Capturing HTTP Errors using scrapy

2 Answers 2

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related