0

I have been struggling with this for the past two days. I need to scrape data from this website for all "Cadres" or categories. Unfortunately, the website allows access to this data via a dropdown menu "Select Cadre" which doesn't have an "All Categories" option. To circumvent this, I am using Scrapy's FormRequest.from_response method but the spider is returning a blank file with no data in it. Any help is appreciated. Here's the code:

import scrapy

class IASWinnerSpider(scrapy.Spider):

    name = 'iaswinner_list'
    allowed_domains = ['http://civillist.ias.nic.in']

    def start_requests(self):
        urls = [ 'http://civillist.ias.nic.in/UpdateCL/DraftCL.asp' ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        return scrapy.FormRequest.from_response(response, method='POST',
                    formdata={'cboCadre': 'UT'}, dont_click=True, callback=self.after_post)

    def after_post(self, response):

        table      = response.xpath('/html/body/div/table//tr')

        for t in table:

            yield {
                'serial': t.xpath('td[1]/text()').extract(),
                'name': t.xpath('td[2]/text()').extract(),
                'qual': t.xpath('td[3]/text()').extract(),
                'dob': t.xpath('td[4]/text()').extract(),
                'post': t.xpath('td[5]/text()').extract(),
                'rem': t.xpath('td[6]/text()').extract(),
            }
3
  • The code given is not yet "complete" (cf. minimal reproducible example). It might be kindly advised to add a __main__ section exhibiting the issue. Commented Aug 19, 2017 at 13:28
  • If Linhart's answer does what you needed please don't forget to mark it 'accepted'. Commented Aug 19, 2017 at 16:12
  • Yes, done it. Thanks. Commented Aug 20, 2017 at 4:19

1 Answer 1

1

When I run your code, I see this in the log:

2017-08-19 15:52:20 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'civillist.ias.nic.in': <POST http://civillist.ias.nic.in/UpdateCL/DraftCL.asp>

Just change allowed_domains to this:

allowed_domains = ['civillist.ias.nic.in']

and it works.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.