Scraping data from multiple URL

Question

I wish to scrape data from [a link]http://cbfcindia.gov.in/html/SearchDetails.aspx?mid=1&Loc=Backlog! , However the MID parameter is incremental in URL to give 2nd, 3rd URL ..... till 1000 URLs, so how shall I deal with this(I am new to PYTHON AND SCRAPY, so dont mind me asking this)?

Please check the XPATH i have used to extract the information, it is fetching no output, is there elementary error in the spider

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from movie.items import MovieItem

class MySpider(BaseSpider):
    name = 'movie'
    allowed_domains= ["http://cbfcindia.gov.in/"]
    start_urls = ["http://cbfcindia.gov.in/html/SearchDetails.aspx?mid=1&Loc=Backlog"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//body")    #Check
        print titles
        items = []
        for titles in titles:
          print "in FOR loop"
          item = MovieItem()
                  item ["movie_name"]=hxs.xpath('//TABLE[@id="Table2"]/TR[2]/TD[2]/text()').extract()
          print "XXXXXXXXXXXXXXXXXXXXXXXXX  movie name:", item["movie_name"]
          item ["movie_language"] = hxs.xpath('//*[@id="lblLanguage"]/text()').extract()
          item ["movie_category"] = hxs.xpath('//*[@id="lblRegion"]/text()').extract()
          item ["regional_office"] = hxs.xpath('//*[@id="lblCertNo"]/text()').extract()
          item ["certificate_no"] = hxs.xpath('//*[@id="Label1"]/text()').extract()
          item ["certificate_date"] = hxs.xpath('//*@id="lblCertificateLength"]/text()').extract()
          item ["length"] = hxs.xpath('//*[@id="lblProducer"]/text()').extract()
          item ["producer_name"] = hxs.xpath('//*[@id="lblProducer"]/text()').extract()

          items.append(item)

          print "this is ITEMS"
        return items

Below is the log :

log>
    {'certificate_date': [],
     'certificate_no': [],
     'length': [],
     'movie_category': [],
     'movie_language': [],
     'movie_name': [],
     'producer_name': [],
     'regional_office': []}
2014-06-11 23:20:44+0530 [movie] INFO: Closing spider (finished)
214-06-11 23:20:44+0530 [movie] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 256,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 6638,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 6, 11, 17, 50, 44, 54000),
     'item_scraped_count': 1,
     'log_count/DEBUG': 4,
     'log_count/INFO': 7,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2014, 6, 11, 17, 50, 43, 681000)}

I can make use of below code to create a list of START_URLS, but I want to do it for i in range(1,1000) , will that create any issues? However, I am unable to scrape the data, ITEMS are still empty start_urls = [] for i in range(1,10): url = 'cbfcindia.gov.in/html/SearchDetails.aspx?mid=' + str(i) + '&Loc=Backlog' start_urls.append(url) — OSK
– OSK, Commented Jun 11, 2014 at 18:59

WGS · Accepted Answer · 2014-06-12 15:36:27Z

2

In addition to @Talvalin's answer, the correct XPath should be of the form:

item["movie_name"] = hxs.xpath("//*[@id='lblMovieName']/font/text()").extract()

For some reason, when the page loads, the <font> tag get separated from the <span> tag (or whatever tag the id is in). I've tested this and it works.

Word of warning, though: the site is pretty much protected from scraping. I've tried running a second scrape and it immediately threw a Runtime Error.

answered Jun 12, 2014 at 15:36

WGS

14.2k5 gold badges50 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Talvalin · Accepted Answer · 2014-06-12 14:02:11Z

1

Allowed domains should be defined without the http://. For example:

allowed_domains= ["cbfcindia.gov.in/"]

If any issues persist, then please show the full log that includes details of the pages crawled and any redirects that may have occurred.

answered Jun 12, 2014 at 14:02

Talvalin

7,8972 gold badges33 silver badges40 bronze badges

1 Comment

warvariuc Over a year ago

domains should be without trailing slash

Collectives™ on Stack Overflow

Scraping data from multiple URL

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related