1

I wish to scrape data from [a link]http://cbfcindia.gov.in/html/SearchDetails.aspx?mid=1&Loc=Backlog! , However the MID parameter is incremental in URL to give 2nd, 3rd URL ..... till 1000 URLs, so how shall I deal with this(I am new to PYTHON AND SCRAPY, so dont mind me asking this)?

Please check the XPATH i have used to extract the information, it is fetching no output, is there elementary error in the spider

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from movie.items import MovieItem

class MySpider(BaseSpider):
    name = 'movie'
    allowed_domains= ["http://cbfcindia.gov.in/"]
    start_urls = ["http://cbfcindia.gov.in/html/SearchDetails.aspx?mid=1&Loc=Backlog"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//body")    #Check
        print titles
        items = []
        for titles in titles:
          print "in FOR loop"
          item = MovieItem()
                  item ["movie_name"]=hxs.xpath('//TABLE[@id="Table2"]/TR[2]/TD[2]/text()').extract()
          print "XXXXXXXXXXXXXXXXXXXXXXXXX  movie name:", item["movie_name"]
          item ["movie_language"] = hxs.xpath('//*[@id="lblLanguage"]/text()').extract()
          item ["movie_category"] = hxs.xpath('//*[@id="lblRegion"]/text()').extract()
          item ["regional_office"] = hxs.xpath('//*[@id="lblCertNo"]/text()').extract()
          item ["certificate_no"] = hxs.xpath('//*[@id="Label1"]/text()').extract()
          item ["certificate_date"] = hxs.xpath('//*@id="lblCertificateLength"]/text()').extract()
          item ["length"] = hxs.xpath('//*[@id="lblProducer"]/text()').extract()
          item ["producer_name"] = hxs.xpath('//*[@id="lblProducer"]/text()').extract()

          items.append(item)

          print "this is ITEMS"
        return items

Below is the log :

log>
    {'certificate_date': [],
     'certificate_no': [],
     'length': [],
     'movie_category': [],
     'movie_language': [],
     'movie_name': [],
     'producer_name': [],
     'regional_office': []}
2014-06-11 23:20:44+0530 [movie] INFO: Closing spider (finished)
214-06-11 23:20:44+0530 [movie] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 256,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 6638,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 6, 11, 17, 50, 44, 54000),
     'item_scraped_count': 1,
     'log_count/DEBUG': 4,
     'log_count/INFO': 7,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2014, 6, 11, 17, 50, 43, 681000)}
1
  • I can make use of below code to create a list of START_URLS, but I want to do it for i in range(1,1000) , will that create any issues? However, I am unable to scrape the data, ITEMS are still empty start_urls = [] for i in range(1,10): url = 'cbfcindia.gov.in/html/SearchDetails.aspx?mid=' + str(i) + '&Loc=Backlog' start_urls.append(url) Commented Jun 11, 2014 at 18:59

2 Answers 2

2

In addition to @Talvalin's answer, the correct XPath should be of the form:

item["movie_name"] = hxs.xpath("//*[@id='lblMovieName']/font/text()").extract()

For some reason, when the page loads, the <font> tag get separated from the <span> tag (or whatever tag the id is in). I've tested this and it works.

Word of warning, though: the site is pretty much protected from scraping. I've tried running a second scrape and it immediately threw a Runtime Error.

Sign up to request clarification or add additional context in comments.

Comments

1

Allowed domains should be defined without the http://. For example:

allowed_domains= ["cbfcindia.gov.in/"]

If any issues persist, then please show the full log that includes details of the pages crawled and any redirects that may have occurred.

1 Comment

domains should be without trailing slash

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.