2

I am trying to build a Python/JS Web Service through a REST API.

My scenario is as follows:

  1. User clicks on a button on my website
  2. My website sends an HTTP Request to the REST API
  3. Web scraping happens on the Server-side (using either Python or Node). The data on the third-party website is loaded dynamically.
  4. The results are sent back in JSON format to my website to be shown to the user

I checked a number of Python hosting services. I cannot tell if they support Selenium or not. Same for the JS libraries and NodeJS hostings.

Basically, I'm confused. What should I use for my project and scraping dynamic data? Python with Selenium? NodeJS with PhantomJS and Cheerio?

1 Answer 1

1

Neither Selenium(alone) nor CheerIO will give you the ability to load the data dynamically from a third-party website.

The answer you're searching for is PhantomJs. Using PhantomJS allows you to load the data dynamically from the third-party website and interact with it using Javascript, you can do things such as scroll down to request more data, and start scrapping when new content is added.

I worked on similar project myself. I was scraping data from facebook while interacting with the page through Javascript and scrap data after a bunch of interactions to load all the data I need to scrap, then save all this data in XML files to store them later on an OrientDB database. In this project I used Selenium along with PhantomJS driver, but PhantomJS is already a NodeJs framework, however I used Python because this project was expected to be larger and contain more data science stuff.

In your case, if the scenario is just scraping the data then retrieve it to remote host/client, then I recommend Node + PhantomJS to you.

Sign up to request clarification or add additional context in comments.

2 Comments

@Ahmad, thanks for your answer. you are right. I had to be specific. I'm already using PhantomJS (node-horseman) along with Cheerio in my projects.But I also think that scalability is the biggest source confusion for me. JS should work fine now, but why do you think a Python solution scales better? (I should think long-term as well). Also, as another issue, websites can easily detect the browser. In case on Phantom, I've seen the unsupported browser message (via screenshot). Have you heard of any problems with Phantom (in case of being blocked by websites)?
it's not about that Python solution scales better, it's about that javascript solution can mess-up very quick, beside node js is very young, it doesn't have libraries like numpy and pandas, this what I considered when I was developing a data-science project. And no I didn't hear about problems with Phantom, but your site may use some javascript functionalities that are not supported on some browsers, in this case I recommend chromedriver because almost all websites support Chrome. But you will have to face the problem of executing scripts inside the page.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.