1

I am using node.js to open a list of web pages and parse the HTML contents.

I supply the URLs inside the script as an array, then call request to retrieve the HTML, which I then parse with Cheerio.

The problem I have is that some webpages do not list the URL inside the HTML content.

So I want to determine the URL of the page that I am parsing from within my request callback.

Since request is asynchronous, I cannot rely on the outer loop (loops through the array of URL strings) to get the URL.

Any ideas?

var requestList = [ 'https://blahblah.com', 'https://blah2.com' ];
for (var i = 0; i < (requestList.length); i++) {  
  request(requestList[i], function (error, response, html) {
    if (!error && response.statusCode == 200) {
      var $ = cheerio.load(html);
      ...
      // how can i determine the URL of this html body?

Thanks for any suggestions!

1
  • Try some console logs to the parsing variables Commented Jan 2, 2018 at 7:54

1 Answer 1

3

You can use instead #Array.forEach and use closure to capture the URL

requestList.forEach((url)=>{

    request(url, (err,res,html) => {
         console.log(url)
        // rest of code here...
    });
});

Why it works?

Closure captures all the references (that the current closure can reach by the scopes). it's a function that has its own memory (kind of)

For example, let's take a look at this code that you could also do that with a loop:

for (var i = 0; i < (requestList.length); i++) {
    handleRequest(requestList[i]);
}

function handleRequest(url) {
    // scope a
    request(url, function (error, response, html) {
        // scope b, (closure)
        console.log(url);
        // rest of the code
    })
}

Since scope b captures the values it can reach, it will remember the URL variable

using closures sometimes can be dangerous because you can have memory leaks (when closure points to something from outside and something from outside points to something in the closure)

Sign up to request clarification or add additional context in comments.

4 Comments

Right answer, but you should explain why this works, whereas a for loop does not.
@Brad you are right, but since op wrote that he understands it's async and the loop is sync (which is usually the part most of the people misses) I skipped that I'll update my answer
@DanielKrom thank you this worked beautifully and makes so much sense!
@Brad thanks for suggesting the additional explanation too.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.