I want to scrap data regarding all movies from cbfcindia.
1) In SEARCH BOX, if Title = "a" all movies beginning from "a" are populated, (in URL, va=a&Type=search) http://cbfcindia.gov.in/html/uniquepage.aspx?va=a&Type=search
2) A list of movies are populated in a table, now this is JAVASCRIPT HERE, if I click on first movie, I enter details of it, and I wish to scrape all these details for all the movies. But I am unable to do it even for a single movie.
3) My Observation: in source there is below function:
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
and we need to pass the parameters based on JS. But I have no idea how it can be done.
items.py
from scrapy.item import Item, Field
class CbfcItem(Item):
MovieName = Field()
MovieLanguage = Field()
Roffice = Field()
CertificateNo = Field()
CertificateDate = Field()
Length = Field()
NameofProducer = Field()
#pass
cbfcspider.py
from cbfc.items import CbfcItem
class MySpider(BaseSpider):
name = 'cbfc'
allowed_domains= ["http://cbfcindia.gov.in/"]
start_urls = ["http://cbfcindia.gov.in/html/uniquepage.aspx?va=a&Type=search"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//tbody") #Check
print titles
items = []
for titles in titles:
print "in FOR loop"
item = CbfcItem()
item ["MovieName"] = hxs.path('//*[@id="lblMovieName"]/text()').extract()
item ["MovieLanguage"] = hxs.path('//*[@id="lblLanguage"]').extract()
item ["Roffice"] = hxs.path('//*[@id="lblRegion"]').extract()
item ["CertificateNo"] = hxs.path('//*[@id="lblCertNo"]').extract()
item ["CertificateDate"] = hxs.path('//*[@id="Label1"]').extract()
item ["Length"] = hxs.path('//*[@id="lblCertificateLength"]').extract()
item ["NameofProducer"] = hxs.path('//*[@id="lblProducer"]').extract()
items.append(item)
print "this is ITEMS"
return items
print "End of FOR"