2

I've done web scraping before but it was never this complex. I want to grab course information from a school website. However all the course information is displayed in a web scraper's nightmare.

First off, when you click the "Schedule of Classes" url, it directs you through several other pages first (I believe to set cookies and check other crap).

Then it finally loads a page with an iframe that apparently only likes to load when it's loaded from within the institution's webpage (ie arizona.edu).

From there the form submissions have to be made via buttons that don't actually reload the page but merely submit a AJAX query and I think it just manipulates the iframe.

This query is particularly hard for me to replicate. I've been using PHP and curl to simulate a browser visiting the initial page, gather's the proper cookies and such. But I think I have a problem with the headers that my curl function is sending because it never lets me execute any sort of query after the initial "search form" loads.

Any help would be awesome...

http://www.arizona.edu/students/registering-classes -> "Schedule of Classes"

Or just here: http://schedule.arizona.edu/

2
  • site scraping smells. if they have no API to provide their data, they don't want to be scraped Commented Sep 20, 2011 at 7:58
  • @Col.Shrapnel Well I kind of need the data. And I doubt they setup this system to avoid being scraped. It's part of a larger system they implemented to manage the entire academia. Advisers use this same system to approve students for classes etc, I don't they're trying to prevent advisers from scraping. It's just a pre-built system that they chose to use which is the clunkiest thing I've ever seen. I plan on developing an application to help students, and if the application ever gets any momentum I'll approach the school directly and say hey...give me APIs so I can do this the easy way. Commented Sep 20, 2011 at 8:01

3 Answers 3

6

If you need to scrape a site with heavy JS / AJAX usage - you need something more powerful than php ;)

First - it must be full browser with capability to execute JS, and second - there must be some api for auto-browsing.

Assuming that you are a kid (who else would need to parse a school) - try Firefox with iMacros. If you are more seasoned veteran - look towards Selenium.

Sign up to request clarification or add additional context in comments.

3 Comments

Ugh so now I actually have to dedicate a home computer to this operation? Unless I actually get a dedicated server, I was hoping for some solution that could be easily implemented as part of a website cron job.
You can run firefox/ iceweasel on a server. Check pages 70-72 in this presentation. defcon.org/images/defcon-17/dc-17-presentations/…
This is very helpful sir, even though as I suspected I probably need a dedicated server or a virtual server to install iMacros on. This was all very helpful, cause this might have to be the way I end up going. Baring someone posts a url to some php code that works like a dream, this is exactly what I was looking for.
4

I used to scrap a lot of pages with JS, iframes and all kinds of that stuff. I used PhantomJS as a headless browser, that later I wrapped with PhantomCurl wrapper. The wrapper is a python script that can be run from command line or imported as a module

Comments

0

Are you sure you are allowed to scrape the site?

If yes, then they could just give you a simple REST api?

In rare case when they would allow you to get to the data, but would not provide API, my advice would be to install some software to record your HTTP interaction with web site, maybe wireshark, or some HTTP proxy, but it is important that you get all details of http requests recorded. After you have that, analyze it, and try to replay it up to the latest bit.

Among possible chores, it might be that at some point in time server sends you generated javascript, that needs to be executed by the client browser in order to get to the next step. In this case you would need to figure how to parse received javascript, and figure out how to move next.

An also good idea would be not to fire all your http requests in burst mode, put put some random delays so that it appears to the server more "human" like.

But in the end you need to figure out if all this is worth the trouble? Since almost any road block to scraping can be worked around, but it can get quite involved and time consuming.

2 Comments

There is a UofA Android App that shows course statuses. Maybe I'll start there. It muse have some API function where it gather's its information. Again, I highly doubt the purpose of their system is to avoid being scraped. However, I doubt they'll implement some feature just to accommodate my needs. I might be able to gain access to the information used by the UofA app. Do you know of any Windows application that can track the web queries of an Android app?
Hmm, try downloading android sdk and emulator, and run the app under it. Since emulator is acting as http proxy, it is very likely that it also offers logging.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.