Scrapy: next button uses javascript

Question

I am trying to scrape from this website http://saintbarnabas.hodesiq.com/joblist.asp?user_id= and I want to get all the RNs in it... I can scrape a data but cannot continue to the next page because of its javascript. I tried reading to other questions but I don't get it. This is my code

class MySpider(CrawlSpider):
    name = "commu"
    allowed_domains = ["saintbarnabas.hodesiq.com"]
    start_urls = ["http://saintbarnabas.hodesiq.com/joblist.asp?user_id=",
    ]
    rules = (Rule (SgmlLinkExtractor(allow=('\d+'),restrict_xpaths=('*'))
    , callback="parse_items", follow= True),
    )

the next button shows as

<a href="Javascript: Move('next')">Next</a>

This pagination is kills me...

If you need to scrape JavaScript or AJAX content you can read it through Selenium WebDriver and Firefox which opens a full-blown browser to read the pages. — Mikko Ohtamaa
– Mikko Ohtamaa, Commented Sep 15, 2013 at 9:31
how? can you give me an idea so that it can direct to another page... — chano
– chano, Commented Sep 15, 2013 at 10:28

R. Max · Accepted Answer · 2013-09-15 17:04:40Z

4

In short, you need to figure out what Move('next') does and reproduce that in your code.

A quick inspection of the sites shows that the function code is this:

function Move(strIndicator)
{
    document.frm.move_indicator.value = strIndicator;
    document.frm.submit();
}

And the document.frm is the form with name "frm":

<form name="frm" action="joblist.asp" method="post">

So, basically you need to build a request to perform the POST for that form with the move_indicator value as 'next'. This is easily done by using the FormRequest class (see the docs) like:

return FormRequest.from_response(response, formname="frm", 
                                 formdata={'move_indicator': 'next'})

This technique works in most cases. The difficult part is to figure out what does the javascript code, sometimes it might be obfuscated and perform overly complex stuff just to avoid being scraped.

answered Sep 15, 2013 at 17:04

R. Max

6,7501 gold badge29 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

paul trmbrth Over a year ago

I've written a BaseSpider using your answer at c9.io/redapple/so_18810850 . You're welcome to amend.

chano Over a year ago

so where should I insert the something.select.something.extract()?

R. Max Over a year ago

@chano The link just trigger the post in a form. The FormRequest parses the form in the page, loading the form's fields automatically, and constructs a request object. You don't need to .extract() anything else for this request.

chano Over a year ago

I did try to run the code and I can see that it parse. But What I want to do is scrape some data.

R. Max Over a year ago

Right, use the pattern allow="JobID=\d+" in your rule extractor and remove the restrict_xpaths="*".

Collectives™ on Stack Overflow

Scrapy: next button uses javascript

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related