1

I'm issuing a cURL GET request to a webpage to download the HTML. The webpage has scroll based content display like a Facebook timeline, i.e. initially only loads some content and then incrementally loads more as user scrolls.

I'm wondering if I can leverage JS and use something like window.scroll(0, document.height) with the cURL GET request to specify a height the page should scroll to. I know what height I need to scroll to to get the HTML I need on all webpages of the site.

My cURL request looks like the following:

curl -X GET 'https://www.mywebsite.com/username/photos' --verbose --user-agent $USER_AGENT --cookie $COOKIES --cookie-jar $COOKIES
8
  • AFAIK cURL has no rendering engine so things like "height" make no sense Commented Apr 4, 2014 at 18:00
  • Javascript events run on the client. Curl runs on the server. So, uh, no. Commented Apr 4, 2014 at 18:00
  • I don't think you can, cURL only gets the file content, and is not a browser, what you need is a headless browser Commented Apr 4, 2014 at 18:00
  • Yes, that was my gut feeling as well. Any other ideas on any neat tricks which can be played? Commented Apr 4, 2014 at 18:01
  • @adeneo you mean something like Phantom.js? Commented Apr 4, 2014 at 18:02

2 Answers 2

1

By using the client rendering, no. But, if you can change the landing page, you can specify a parameter that will render enough content on the client thus making it available for your cURL request. So, for example, calling https://www.mywebsite.com/username/photos?curl=1 would prerender the portion of the page that you need to scrape.

If you don't control the landing page, you can add all the scrolling AJAX calls and rebuild the HTML structure, provided there's no session control or something similar that you can't predict and that won't return any content otherwise.

Sign up to request clarification or add additional context in comments.

6 Comments

I see, yeah I just had a look at the Network tab on my browser and I see some Ajax calls being made as I scroll down. Do not control the landing page.
So it's not your website... Hmm, it's going to be pretty tough to handle it using cURL then. Try to figure out what happens in those calls.
I see, yeah not my website actually! Yeah I'm looking at the Ajax actions. I guess if I can figure those out then perhaps those can be passed with the cURL request?
Yup, you can totally make those calls with cURL. See which ones you need and see if you'll be able to rebuild the HTML.
This worked well, I found the Ajax call that is generating exactly what I need (in this case photo URLs). It takes some parameters from the cookie such as User ID and some other items. Now trying to add these params to the cookie file and it should work.
|
0

I'm posting my answer in case this may be useful to anyone else.

cURL itself will not accept Javascript as pointed out by the folks above. However, if the webpage is making any Ajax requests then you may be in luck.

If the webpage is loading data via Ajax calls, then a browser's (say Chrome) Network request logging option can be leveraged. The Ajax request (or a PHP request for that matter) loading the data can be saved as a cURL requests from within Chrome's Network tab.

More information on saving the Network Logs is available on

Google developers page

Chrome's Network logger will automatically package the headers, user agent, cookie parameters into the cURL request and pretty much outputs a command ready to run on the shell.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.