I need to do server side web scraping/navigation, including sites with JavaScript, and I need a solution that would work on a hosting plan - I dont have my own server. I came across python/pyside/pyqt4 - this would work perfectly/allow me to navigate sites like a headless browser. However I don't know if this would be possible to install on a remote server/host...
1 Answer
If you need a headless browser, you should check out PhantomJS, and in particular PyPhantomJS, the Python implementation. These might work in a shared hosting context - it really depends on the host. See the build instructions for different platforms - you'd likely need to ask your hosting provider to install.
If you can get this running, you might be interested in checking out pjscrape (disclaimer: this is my project). It's a command-line tool using PhantomJS to allow scraping using JavaScript and jQuery in a full browser context.
4 Comments
James
Do you know if there might be any solutions that are implemented in python or ruby or php... Something I could just upload to my hosting space?
James
Also I think HTMLUnitwould probably do the job well... This is in java... do you know of any web hosts with java support?
James
Also, how does your pyscrape work client side if the same origin policy prevents JavaScript on one domain from accessing data on another?
nrabinowitz
1) PyPhantomJS is implemented in Python, as I noted in my answer. It involves Webkit, though, so I don't know if installation would be as simple as uploading it. 2) Pjscrape runs through PhantomJS, so it's not really "client-side" - it injects JS into the current page context.