5

So I'm making a little scraper for learning purposes, in the end I should get a tree-like structure of the pages on the website.

I've been banging my head trying to get the requests right. This is more or less what I have:

var request = require('request');


function scanPage(url) {

  // request the page at given url:


  request.get(url, function(err, res, body) {

    var pageObject = {};

    /* [... Jquery mumbo-jumbo to

        1. Fill the page object with information and
        2. Get the links on that page and store them into arrayOfLinks 

    */

    var arrayOfLinks = ['url1', 'url2', 'url3'];

    for (var i = 0; i < arrayOfLinks.length; i++) {

      pageObj[arrayOfLinks[i]] = scanPage[arrayOfLinks[i]];

    }
  });

    return pageObj;
}

I know this code is wrong on many levels, but it should give you an idea of what I'm trying to do.

How should I modify it to make it work? (without the use of promises if possible)

(You can assume that the website has a tree-like structure, so every page only has links to pages further down the three, hence the recursive approach)

7
  • You would probably need an html parser. Try googling something like "javascript html parser"... Commented May 31, 2016 at 13:09
  • Thank you, but it has nothing to do with my question. I parse the html with cheerio (node.js jquery implementation), my problem is how to handle recursively building my object. Commented May 31, 2016 at 13:18
  • The biggest challenge here is to achieve recursive behavior due to async nature for javascript. Commented May 31, 2016 at 13:23
  • I wanted to achieve something similar a while back, with little the time i had i decide to go with npmjs.com/package/sync-request Commented May 31, 2016 at 13:26
  • AJS: Hmm, I'll try that until a better solution arises Commented May 31, 2016 at 14:04

1 Answer 1

1

I know that you'd rather not use promises for whatever reason (and I can't ask why in the comments because I'm new), but I believe that promises are the best way to achieve this.

Here's a solution using promises that answers your question, but might not be exactly what you need:

var request = require('request');
var Promise = require('bluebird');
var get = Promise.promisify(request.get);

var maxConnections = 1; // maximum number of concurrent connections

function scanPage(url) {

    // request the page at given url:

    return get(url).then((res) => {

        var body = res.body;

        /* [... Jquery mumbo-jumbo to

        1. Fill the page object with information and
        2. Get the links on that page and store them into arrayOfLinks

        */

        var arrayOfLinks = ['url1', 'url2', 'url3'];

        return Promise.map(arrayOfLinks, scanPage, { concurrency: maxConnections })
                            .then(results => {
                                var res = {};
                                for (var i = 0; i < results.length; i++)
                                    res[arrayOfLinks[i]] = results[i];
                                return res;
                            });

    });

}

scanPage("http://example.com/").then((res) => {
    // do whatever with res
});

Edit: Thanks to Bergi's comment, rewrote the code to avoid the Promise constructor antipattern.

Edit: Rewrote in a much better way. By using Bluebird's concurrency option, you can easily limit the number of simultaneous connections.

Sign up to request clarification or add additional context in comments.

4 Comments

Avoid the Promise constructor antipattern! You should only promisify request.get using it, and then chain the rest of the code to it using .then(…).
Don't run this on something like wikipedia... you may just hog all the bandwidth on your local network, heat up your CPU and possibly be suspected of DDoSing the website or something. Also try to prevent cyclical links from doing something like url1 -> url2 -> url1 -> ....
I had come to a similar solution, the problem is that all requests fire at the same time and the server is not happy (cf. what Patrick Roberts says). I tried doing it sequantially with reduce() but it's a bit too advanced for me, so that's why I was asking for a "classical" solution.
var promises = arrayOfLinks.map(scanPage);

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.