1

I have a .list file containing information on movies. The file is formatted as follows

New  Distribution  Votes  Rank  Title
      0000000125  1176527   9.2  The Shawshank Redemption (1994)
      0000000125  817264   9.2  The Godfather (1972)
      0000000124  538216   9.0  The Godfather: Part II (1974)
      0000000124  1142277   8.9  The Dark Knight (2008)
      0000000124  906356   8.9  Pulp Fiction (1994)

The code I have so far is as follows:

//modules ill be using
var fs = require('fs');
var csv = require('csv');

csv().from.path('files/info.txt', { delimiter: '  '})
.to.array(function(data){
    console.log(data);
});

But because the values are separated by single spaces, double spaces and tabs. There is no single delimiter to use. How can I extract this information into an array?

2
  • This list file is auto-generated or, you have manualy created it? Commented Mar 20, 2014 at 13:02
  • auto generated, its the imdb one found ftp.fu-berlin.de/pub/misc/movies/database Commented Mar 20, 2014 at 13:02

2 Answers 2

3

You can shrink multiple spaces in to one space with and then you can read it as string like;

fs = require('fs')
fs.readFile('files/info.txt', 'utf8', function (err, csvdata) {
  if (err) {
    return console.log(err);
  }
  var movies = csvdata.replace(/\s+/g, "\t");

  csv().from.string(moviews, { delimiter: '\t'})
    .to.array(function(data){
        console.log(data);
    });

});
Sign up to request clarification or add additional context in comments.

3 Comments

I think multiple spaces to one tab will be better, otherwise "The Shawshank Redemption (1994)" will be parsed as four fields.
I decided to separate with commas on two or more spaces data.replace(/\s{2,}/g, ",") - thanks for the response :)
Good to hear that, an upvote would be appreciated :)
0

It looks easy to parse with regex:

function parse(row) {
  var match = row.match(/\s{6}(\d*)\s{2}(\d*)\s{3}(\d*\.\d)/)
  return {
    distribution: match[1],
    votes: match[2],
    rank: match[3]
  };
}

fs.readFileSync(file)
  .split('\n')
  .slice(1) //since we don't care about the first row
  .map(parse);

I will live you to build the rest of the regex. I juse two tools to do so: rubular.com and node.js repl.

This \s{6}(\d*)\s{2}(\d*) means: MATCH 6 SPACEs, then capture an arbitrary number of digits then match 2 spaces, then capture another arbitrary number of digits, etc.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.