Scraping DATA from Javascript using SCRAPY and PYTHON

Question

I want to scrap data regarding all movies from cbfcindia.

1) In SEARCH BOX, if Title = "a" all movies beginning from "a" are populated, (in URL, va=a&Type=search) http://cbfcindia.gov.in/html/uniquepage.aspx?va=a&Type=search

2) A list of movies are populated in a table, now this is JAVASCRIPT HERE, if I click on first movie, I enter details of it, and I wish to scrape all these details for all the movies. But I am unable to do it even for a single movie.

3) My Observation: in source there is below function:

function __doPostBack(eventTarget, eventArgument) {
    if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
        theForm.__EVENTTARGET.value = eventTarget;
        theForm.__EVENTARGUMENT.value = eventArgument;
        theForm.submit();
    }
}

and we need to pass the parameters based on JS. But I have no idea how it can be done.

items.py

from scrapy.item import Item, Field

class CbfcItem(Item):
    MovieName = Field()
    MovieLanguage = Field()
    Roffice = Field()
    CertificateNo = Field()
    CertificateDate = Field()
    Length = Field()
    NameofProducer = Field()
    #pass

cbfcspider.py

from cbfc.items import CbfcItem

class MySpider(BaseSpider):
    name = 'cbfc'
    allowed_domains= ["http://cbfcindia.gov.in/"]
    start_urls = ["http://cbfcindia.gov.in/html/uniquepage.aspx?va=a&Type=search"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//tbody")    #Check
        print titles
        items = []
        for titles in titles:
            print "in FOR loop"
            item = CbfcItem()
            item ["MovieName"] = hxs.path('//*[@id="lblMovieName"]/text()').extract()
            item ["MovieLanguage"] = hxs.path('//*[@id="lblLanguage"]').extract()
            item ["Roffice"] = hxs.path('//*[@id="lblRegion"]').extract()
            item ["CertificateNo"] = hxs.path('//*[@id="lblCertNo"]').extract()
            item ["CertificateDate"] = hxs.path('//*[@id="Label1"]').extract()
            item ["Length"] = hxs.path('//*[@id="lblCertificateLength"]').extract()
            item ["NameofProducer"] = hxs.path('//*[@id="lblProducer"]').extract()
            items.append(item)          
            print "this is ITEMS"
        return items
        print "End of FOR"

Pawel Miech · Accepted Answer · 2014-06-02 20:24:54Z

2

If you look deeper into source each link has following markup:

<a id="DGMovie_ctl03_lnk" href="javascript:__doPostBack('DGMovie$ctl03$lnk','')">AGNI PARIKSHAYA</a>

Now you know how this javascript function is actually called, you have value of event target and event argument. To make sure that you are on right track you can also check what happens by investigating page with developer tools, if you are using chrome remember to check "preserve log" button. You will see first argument to postback in href as EVENTTARGET.

Following xpath with regular expressions will give you all postback arguments:

sel.xpath("//*[contains(@id,'DGMovie')]/@href").re("doPostBack\(\'([^']+)")

You need to make POST request with each param to get your information. Note that your web page uses iframes, so you need to get into iframe source first.

pawel@stack:~/stack$ scrapy shell "http://cbfcindia.gov.in/html/uniquepage.aspx?va=a&Type=search"
In [31]: url = sel.xpath("//iframe/@src").extract()[0]

In [33]: url
Out[33]: u'searchresults.aspx?va=a&Type=search'

In [35]: from urlparse import urljoin

In [36]: url = urljoin(response.url, url) 

In [39]: from scrapy.http import Request

In [40]: req = Request(url)
in [41]: fetch(req)

# after fetching request..
In [48]: js_links = sel.xpath("//*[contains(@id,'DGMovie')]/@href").re("doPostBack\(\'([^']+)")
In [49]: param = js_links[0]

In [50]: param
Out[50]: u'DGMovie$ctl03$lnk'

In [51]: from scrapy.http import FormRequest

In [52]: fr = FormRequest.from_response(response, formdata={"__EVENTTARGET":param})

In [53]: fetch(fr)
2014-06-02 21:09:09+0100 [default] DEBUG: Redirecting (302) to <GET http://cbfcindia.gov.in/html/SearchDetails.aspx?mid=15&Loc=Backlog> from <POST http://cbfcindia.gov.in/html/searchresults.aspx?va=a&Type=search>
2014-06-02 21:09:10+0100 [default] DEBUG: Crawled (200) <GET http://cbfcindia.gov.in/html/SearchDetails.aspx?mid=15&Loc=Backlog> (referer: None)
In [54]: view(response)

In spider you need to refactor your parse method so that it yields FormRequest with callback to parse_items, than move your parsing logic to parse_items (from parse).

Don't forget about pagination, this is done with postbacks as well!

Those asp.net pages with postback are usually most difficult to parse. Read more about them if you are interested

answered Jun 2, 2014 at 20:24

Pawel Miech

7,8624 gold badges41 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

OSK Over a year ago

where did you find "<a id="DGMovie_ctl03_lnk" href="javascript:__doPostBack('DGMovie$ctl03$lnk','')">AGNI PARIKSHAYA</a>"

Pawel Miech Over a year ago

just look it up in console, use developer tools hover over each movie

sangharsh Over a year ago

I haven't understood anythong from that cpan thing. I face similar problem. Can you please guide.

Collectives™ on Stack Overflow

Scraping DATA from Javascript using SCRAPY and PYTHON

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related