4

I'm downloading a webpage using the request module which is very straight forward.

My problem is that the page I'm trying to download has some async scripts (have the async attributes) and they're not downloaded with the html document return from the http request.

My question is how I can make an http request with/with-out (preferably with) request module, and have the WHOLE page download without exceptions as described above due to some edge cases.

3
  • using a headless browser, maybe Commented Jan 22, 2016 at 21:14
  • @mithril_knight Hi, thanks for the reply, look at my comment for chriskelly post. Still looking for a solution. :) Commented Jan 23, 2016 at 14:13
  • Still struggling, if anyone can help me find a solution i would be grateful Commented Feb 6, 2016 at 23:29

2 Answers 2

2

Sounds like you are trying to do webscraping using Javascript.

Using request is a very fundemental approach which may be too low-level and tiome consuming for your needs. The topic is pretty broad but you should look into more purpose built modules such as cheerio, x-ray and nightmare.

x-ray x-ray will let you select elements directly from the page in a jquery like way instead of parsing the whole body.

nightmare provides a modern headless browser which makes it possible for you to enter input as though using the browser manually. With this you should be able to better handle the ajax type requests which are causing you problems.

HTH and good luck!

Sign up to request clarification or add additional context in comments.

2 Comments

You're right,basically I'm scraping the web. I'm using regex array to find possible uris inside the returned document because using cherrio/jsdom/x-ray etc won't be enough since there're uris that aren't inside src/href attribute value. Apart from this headless browser won't do as well because what I'm trying to achieve is to archive and mirror a website (something like HTTrack).I have most of the code done, and I've chose to use request to handle http requests but the problem is that unlike opening a website in browser the returned document from request module doesn't include any async script
@Jorayen It happened to me that very case and I had to switch to phantomjs, I used cheerio before that, but same as you, it didn't load the async scripts contents
0

Using only request you could try the following approach to pull the async scripts.

Note: I have tested this with a very basic set up and there is work to be done to make it robust. However, it worked for me:

Test setup

To set up the test I create a html file which includes a script in the body like this: <script src="abc.js" async></script>

Then create temporary server to launch it (httpster)

Scraper

"use strict";

const request = require('request');

const options1 = { url: 'http://localhost:3333/' }

// hard coded script name for test purposes
const options2 = { url: 'http://localhost:3333/abc.js' }

let htmlData  // store html page here

request.get(options1)
    .on('response', resp => resp.on('data', d => htmlData += d))
    .on('end', () => {
        let scripts; // store scripts here

        // htmlData contains webpage
        // Use xml parser to find all script tags with async tags
        // and their base urls
        // NOT DONE FOR THIS EXAMPLE

        request.get(options2)
            .on('response', resp => resp.on('data', d => scripts += d))
            .on('end', () => {
                let allData = htmlData.toString() + scripts.toString();
                console.log(allData);
            })
           .on('error', err => console.log(err))
    })
    .on('error', err => console.log(err))

This basic example works. You will need to find all js scripts on the page and extract the url part which I have not done here.

1 Comment

The problem is that after the first request is done on the 'end' event, the htmlData doesn't contain any async scripts in it so I can't really find those asyn script tags, that is my problem

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.