0

I'm trying to scrape a website for broken links, so far I have this code which is successfully logging in and crawling the site, but it's only recording HTTP status 200 codes:

class HttpStatusSpider(scrapy.Spider):
    name = 'httpstatus'
    handle_httpstatus_all = True

    link_extractor = LinkExtractor()

    def start_requests(self):
        """This method ensures we login before we begin spidering"""
        # Little bit of magic to handle the CSRF protection on the login form
        resp = requests.get('http://localhost:8000/login/')
        tree = html.fromstring(resp.content)
        csrf_token = tree.cssselect('input[name=csrfmiddlewaretoken]')[0].value

        return [FormRequest('http://localhost:8000/login/', callback=self.parse,
                            formdata={'username': 'mischa_cs',
                                      'password': 'letmein',
                                      'csrfmiddlewaretoken': csrf_token},
                            cookies={'csrftoken': resp.cookies['csrftoken']})]

    def parse(self, response):
        item = HttpResponseItem()
        item['url'] = response.url
        item['status'] = response.status
        item['referer'] = response.request.headers.get('Referer', '')
        yield item

        for link in self.link_extractor.extract_links(response):
            r = Request(link.url, self.parse)
            r.meta.update(link_text=link.text)
            yield r

The docs and these answers lead me to believe that handle_httpstatus_all = True should cause scrapy to pass errored requests to my parse method, but so far I've not been able to capture any.

I've also experimented with handle_httpstatus_list and a custom errback handler in a different iteration of the code.

What do I need to change to capture the HTTP error codes scrapy is encountering?

4
  • please remove the allowed_domains argument, it isn't needed and it could also filter your requests, maybe that's the problem Commented Dec 17, 2018 at 19:13
  • I removed the allowed_domains = ['localhost'] with no change in behaviour Commented Dec 17, 2018 at 19:21
  • I put the allowed_domains = ['localhost'] back in, after the spider ended up finding its way onto tripadvisor: 2018-12-17 19:29:09 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186364-c31-zfp5-Sheffield_South_Yorkshire_England.html> Commented Dec 17, 2018 at 19:30
  • ok, so now we are facing another problem? Please check my answer Commented Dec 17, 2018 at 19:32

2 Answers 2

1

handle_httpstatus_list can be defined on the spider level, but handle_httpstatus_all can only be defined on the Request level, including it on the meta argument.

I would still recommend using an errback for these cases, but if everything is controlled, it shouldn't create new problems.

Sign up to request clarification or add additional context in comments.

2 Comments

Ah, very interesting. That's an easily overlooked difference, and I can now see 4xx codes being captured. Not sure that the 5xxs are getting captured though. Next step is to try an errback
Glad I helped you get the http requests you needed.
1

So, I don't know if this is the proper scrapy way, but it does allow me to handle all HTTP status codes (including 5xx).

I disabled the HttpErrorMiddleware by adding this snippet to my scrapy project's settings.py:

SPIDER_MIDDLEWARES = {
    'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': None
}

1 Comment

sure I would say it is a good solution, but of course just for your project, I don't think this could be recommended in a project with a lot of spiders where we only need to disable this for some spiders or even requests.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.