4

I am trying to scrape from this website http://saintbarnabas.hodesiq.com/joblist.asp?user_id= and I want to get all the RNs in it... I can scrape a data but cannot continue to the next page because of its javascript. I tried reading to other questions but I don't get it. This is my code

class MySpider(CrawlSpider):
    name = "commu"
    allowed_domains = ["saintbarnabas.hodesiq.com"]
    start_urls = ["http://saintbarnabas.hodesiq.com/joblist.asp?user_id=",
    ]
    rules = (Rule (SgmlLinkExtractor(allow=('\d+'),restrict_xpaths=('*'))
    , callback="parse_items", follow= True),
    )

the next button shows as

<a href="Javascript: Move('next')">Next</a>

This pagination is kills me...

2
  • If you need to scrape JavaScript or AJAX content you can read it through Selenium WebDriver and Firefox which opens a full-blown browser to read the pages. Commented Sep 15, 2013 at 9:31
  • how? can you give me an idea so that it can direct to another page... Commented Sep 15, 2013 at 10:28

1 Answer 1

4

In short, you need to figure out what Move('next') does and reproduce that in your code.

A quick inspection of the sites shows that the function code is this:

function Move(strIndicator)
{
    document.frm.move_indicator.value = strIndicator;
    document.frm.submit();
}

And the document.frm is the form with name "frm":

<form name="frm" action="joblist.asp" method="post">

So, basically you need to build a request to perform the POST for that form with the move_indicator value as 'next'. This is easily done by using the FormRequest class (see the docs) like:

return FormRequest.from_response(response, formname="frm", 
                                 formdata={'move_indicator': 'next'})

This technique works in most cases. The difficult part is to figure out what does the javascript code, sometimes it might be obfuscated and perform overly complex stuff just to avoid being scraped.

Sign up to request clarification or add additional context in comments.

5 Comments

I've written a BaseSpider using your answer at c9.io/redapple/so_18810850 . You're welcome to amend.
so where should I insert the something.select.something.extract()?
@chano The link just trigger the post in a form. The FormRequest parses the form in the page, loading the form's fields automatically, and constructs a request object. You don't need to .extract() anything else for this request.
I did try to run the code and I can see that it parse. But What I want to do is scrape some data.
Right, use the pattern allow="JobID=\d+" in your rule extractor and remove the restrict_xpaths="*".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.