0

I am attempting to create a web scraper (in node.js) that will pull down information from a site, and write it to a file. I have it built to correctly work for one page, but when I try to use the function in a for loop, to iterate through multiple games, I get bad data in all of the games.

I understand that this is related to Javascript's asynchronous nature, and I have read about callback functions, but I'm not sure I understand how to apply it to my code. Any help would be GREATLY appreciated:

for(x = 4648; x < 4650; x++){  //iterate over a few gameIDs, used in URL for request
    scrapeGame(x);
}

function scrapeGame(gameId){
    //request from URL, scrape HTML to arrays as necessary
    //write final array to file
}

Essentially, what I am looking to do, is within the for loop, tell it to WAIT to finish the scrapeGame(x) function before incrementing x and running it for the next game -- otherwise, the arrays start to overwrite each other and the data becomes a huge mess.

EDIT: I've now included the full code which I am attempting to run! I'm getting errors when looking in the files after they are written. For example, the first file is 8kb, second is ~16, 3rd is ~32, etc. It seems things aren't getting cleared before running the next game.

Idea of the program is to pull Jeopardy questions/answers from the archive site in order to eventually build a quiz app for myself.

//Iterate over arbitrary number of games, scrape each

for(x = 4648; x < 4650; x++){
    scrapeGame(x, function(scrapeResult) {
        if(scrapeResult){
            console.log('Scrape Successful');
        } else {
            console.log('Scrape ERROR');
        }
    });
}

function scrapeGame(gameId, callback){
    var request = require('request');
        cheerio = require('cheerio');
        fs = require('fs');
        categories = [];
        categorylist = [];
        ids = [];
        clues = [];
        values = ['0','$200','$400','$600','$800','$1000','$400','$800','$1200','$1600','$2000'];
        valuelist = [];
        answers = [];
        array = [];
        file = [];
        status = false;

    var showGameURL = 'http://www.j-archive.com/showgame.php?game_id=' + gameId;
    var showAnswerURL = 'http://www.j-archive.com/showgameresponses.php?game_id=' + gameId;

    request(showGameURL, function(err, resp, body){ 
    if(!err && resp.statusCode === 200){
        var $ = cheerio.load(body);
        //add a row to categories to avoid starting at 0
        categories.push('Category List');
        //pull all categories to use for later
        $('td.category_name').each(function(){
            var category = $(this).text();
            categories.push(category);
        });
        //pull all clue IDs (coordinates), store to 1d array
        //pull any id that has "stuck" in the string, to prevent duplicates
        $("[id*='stuck']").each(function(){
            var id = $(this).attr('id');
            id = id.toString();
            id = id.substring(0, id.length - 6);
            ids.push(id);
            //if single J, pick category 1-6
            if (id.indexOf("_J_") !== -1){
                var catid = id.charAt(7);
                categorylist.push(categories[catid]);
                var valId = id.charAt(9);
                valuelist.push(values[valId]);
            }
            //if double J, pick category 7-12
            else if (id.indexOf("_DJ_") !== -1){
                var catid = parseInt(id.charAt(8)) + 6;
                categorylist.push(categories[catid]);
                var valId = parseInt(id.charAt(10)) + 5;
                valuelist.push(values[valId]);                
            }
            //if final J, pick category 13
            else {
                categorylist.push(categories[13]);
            }
        });
        //pull all clue texts, store to 1d array
        $('td.clue_text').each(function(){
            var clue = $(this).text();
            clues.push(clue);
        });
        //push pulled values to big array
        array.push(ids);
        array.push(categorylist);
        array.push(valuelist);
        array.push(clues);

        //new request to different URL to pull responses
        request(showAnswerURL, function(err, resp, body){ 
            if(!err && resp.statusCode === 200){
                var $ = cheerio.load(body);

                $('.correct_response').each(function(){
                    var answer = $(this).text();
                    answers.push(answer);
                });
                //push answers to big array
                array.push(answers);
                //combine arrays into 1-d array to prep for writing to file
                for(var i = 0; i < array[0].length; i++){
                    var print = array[0][i] + "|" + array[1][i] + "|" + array[2][i] + "|" + array[3][i] + "|" + array[4][i];
                    var stringPrint = print.toString();
                    file.push(stringPrint);
                }
                //update string, add newlines, etc.
                var stringFile = JSON.stringify(file);
                stringFile = stringFile.split('\\').join('');
                stringFile = stringFile.split('","').join('\n');
                //write to file, eventually will append to end of one big file
                fs.writeFile('J_GAME_' + gameId +'.txt', stringFile, function(err) {
                    if(err) {
                        console.log(err);
                    } else {
                        console.log("Game #" + gameId + " has been scraped.");
                        status = true;
                    }
                });
            }
        });
    }
});
        //clear arrays used
        valuelist = [];
        answers = [];
        categories = [];
        categorylist = [];
        ids = [];
        clues = [];
        array = [];
        file = [];
        //feed callback status
        callback(status);
}

3 Answers 3

3
// Iterate over a few gameIDs, used in URL for request.
for (x = 4648; x < 4650; x++) {
  // Pass in the callback as an anonymous function.
  // So below I am passing in the id and the function I want to execute.
  // AND, defining the results I am expecting as passed in arguments. 
  scrapeGame(x, function(scrapeResult, err) {
    // This will *NOT* execute *UNTIL* you call it in the function below.
    // That means that the for loop's execution is halted. 
    // This function receives the status that is passed in, 
    // in this case, a boolean true/false and an error if any.
    if (scrapeResult) {
      // Scrape was true, nothing to do.
      // The for loop will now move on to the next iteration.
      console.log('Scrape Successful');
    } else {
      // Scrape was false, output error to console.log and 
      // break loop to handle error.
      console.log('Scrape ERROR :: ' + err);
      // Notice we are calling break while in the 
      // scope of the callback function
      // Remove the break if you want to just move onto
      // the next game ID and not stop the loop
      break;
    }
  });
}

// This function now accepts two arguments.
function scrapeGame(gameId, callback) {

  // ************************************************
  // ** Do Your Work Here **
  // Request from URL, scrape HTML to arrays as necessary.
  // Write final array to file.
  // After file creation, execute the callback and pass bool
  // status (true/false).
  // ************************************************

  var request = require('request'),
      cheerio = require('cheerio'),
      fs = require('fs'),
      categories = [],
      categorylist = [],
      ids = [],
      clues = [],
      values = [
          '0',
          '$200',
          '$400',
          '$600',
          '$800',
          '$1000',
          '$400',
          '$800',
          '$1200',
          '$1600',
          '$2000'
      ],
      valuelist = [],
      answers = [],
      array = [],
      file = [],
      showGameURL = 'http://www.j-archive.com/showgame.php?game_id=' + gameId,
      showAnswerURL = 'http://www.j-archive.com/showgameresponses.php?game_id=' + gameId;

  request(showGameURL, function(err, resp, body) {
    if (!err && resp.statusCode === 200) {
      var $ = cheerio.load(body);
      //add a row to categories to avoid starting at 0
      categories.push('Category List');
      //pull all categories to use for later
      $('td.category_name').each(function() {
        var category = $(this).text();
        categories.push(category);
      });
      //pull all clue IDs (coordinates), store to 1d array
      //pull any id that has "stuck" in the string, to prevent duplicates
      $("[id*='stuck']").each(function() {
        var id = $(this).attr('id');
        id = id.toString();
        id = id.substring(0, id.length - 6);
        ids.push(id);
        //if single J, pick category 1-6
        if (id.indexOf("_J_") !== -1) {
          var catid = id.charAt(7);
          categorylist.push(categories[catid]);
          var valId = id.charAt(9);
          valuelist.push(values[valId]);
        }
        //if double J, pick category 7-12
        else if (id.indexOf("_DJ_") !== -1) {
          var catid = parseInt(id.charAt(8)) + 6;
          categorylist.push(categories[catid]);
          var valId = parseInt(id.charAt(10)) + 5;
          valuelist.push(values[valId]);
        }
        //if final J, pick category 13
        else {
          categorylist.push(categories[13]);
        }
      });
      //pull all clue texts, store to 1d array
      $('td.clue_text').each(function() {
        var clue = $(this).text();
        clues.push(clue);
      });
      //push pulled values to big array
      array.push(ids);
      array.push(categorylist);
      array.push(valuelist);
      array.push(clues);

      //new request to different URL to pull responses
      request(showAnswerURL, function(err, resp, body) {
        if (!err && resp.statusCode === 200) {
          var $ = cheerio.load(body);

          $('.correct_response').each(function() {
            var answer = $(this).text();
            answers.push(answer);
          });
          //push answers to big array
          array.push(answers);
          //combine arrays into 1-d array to prep for writing to file
          for (var i = 0; i < array[0].length; i++) {
            var print = array[0][i] + "|" + array[1][i] + "|" + array[2][i] + "|" + array[3][i] + "|" + array[4][i];
            var stringPrint = print.toString();
            file.push(stringPrint);
          }
          //update string, add newlines, etc.
          var stringFile = JSON.stringify(file);
          stringFile = stringFile.split('\\').join('');
          stringFile = stringFile.split('","').join('\n');
          //write to file, eventually will append to end of one big file
          fs.writeFile('J_GAME_' + gameId + '.txt', stringFile, function(err) {

            //clear arrays used
            valuelist = [];
            answers = [];
            categories = [];
            categorylist = [];
            ids = [];
            clues = [];
            array = [];
            file = [];

            if (err) {
              // ******************************************
              // Callback false with error.
              callback(false, err);
              // ******************************************
            } else {
              console.log("Game #" + gameId + " has been scraped.");
              // ******************************************
              // Callback true with no error. 
              callback(true);
              // ******************************************
            }
          });
        }
      });
    }
  });
}
Sign up to request clarification or add additional context in comments.

10 Comments

In my case, the data is the string that I want to write to the file in question -- so inside of the callback, would I then do the actual writing to the file? Or do I write to the file outside of the callback and it doesn't matter what I pass back to the callback?
No, It doesn't matter, I edited my answer. Let me know if it pertains more to your case.
This seems to pass back the status to the callback, but I'm not doing anything with it -- wouldn't I need to say something like "if false, don't do the next one yet?" I guess I don't understand how the callback is preventing the next call of the function from occurring before it has finished the first time.
By adding the callback you are "pausing" the execution of the for loop. It has to wait for the callback code passed in to scrapeGame to execute before it will move on to the next iteration of the loop, and we do not do that until the end of the scrapeGame function.
I am receiving Scrape ERROR -- I have the callback(status); as the very last line of the scrapeGame function. It would appear that regardless of what comes back from callback(status), it is still processing every iteration of scrapeGame I ask for in the for loop.
|
1

My assumption is that you want them to be scraped one after one, not in parallel. So, for loop won't help. The following approach should do the trick:

    var x = 4648;
    var myFunc = scrapeGame(x, function cb(){
        if(x >= 4650){
           return; 
        }
        x++;
        return myFunc(x, cb); 
    });



function scrapeGame(gameId){
    //request from URL, scrape HTML to arrays as necessary
    //write final array to file
}

For nested async function, where you want them be executed in serial manner, you should just forget about for loop.

An example of correct request handling with http client:

function scrapeGame(gameId, cb){

//your code and set options

http.request(options, function(response){
    var result = "";
    response.on('data', function (chunk) {
                result += chunk;
             });
    response.on('end',function(){
               //write data here;

               //do the callback
               cb();    
            });
});

}

4 Comments

This does execute them in a serial manner (i.e. in the correct order), but since the scrapeGame function has an HTTP request, it seems like they don't let the new requests fully load before it continues. This results in my data not correctly getting cleared before writing the next game to the file -- the first game looks perfect. After that, it's a jumbled mess.
It seems like you are not processing the response properly? Sharing code of that function would be helpful. However, I have added a little more code snippet to show you how you should handle each response data. Data comes in chunk and you will have to gather them untill all chunks comes in and only then you should go ahead/make callback for next request.
Sharing code of that function would be helpful. +1 :D
I will update my code up above after I do some cleanup -- thank you both for the help thus far!
0

I solved the ROOT cause of the issue that I was seeing, though I do believe without the callback assistance from red above, I would have been just as lost.

Turns out the data was processing correctly, but the file write was scrambling. Turns out that there is a different method to call instead of writeFile or appendFile:

fs.appendFileSync();

Calling the Synchronous version processed the writes to the file IN THE ORDER they got appended to the file, instead of just going for it. This, in addition to the callback help above, solved the issue.

Thanks to everyone for the assistance!

1 Comment

Glad you got it sorted out!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.