Parsing huge binary files in Node.js

Question

I want to create Node.js module which should be able to parse huge binary files (some larger than 200GB). Each file is divided into chunks and each chunk can be larger than 10GB. I tried using flowing and non-flowing methods to read file, but the problem is because the end of the readed buffer is reached while parsing chunk, so parsing of that chunk must be terminated before the next onData event occurs. This is what I've tried:

var s = getStream();

s.on('data', function(a){
    parseChunk(a);
});

function parseChunk(a){
    /*
        There are a lot of codes and functions.
        One chunk is larger than buffer passed to this function,
        so when the end of this buffer is reached, parseChunk
        function must be terminated before parsing process is finished.
        Also, when the next buffer is passed, it is not the start of
        a new chunk because the previous chunk is not parsed to the end.
    */
}

Loading whole chunk into process memory isn't prossible because I have only 8GB of RAM. How can I synchronously read data from the stream or how can I pause parseChunk function when the end of the buffer is reached and wait until new data is available?

When you use streams, you turn over the read/writing and buffering to the stream. But, you seem to want precise control of exactly what is read and when it is read. Why don't you just read the exact number of bytes you want to read yourself direct from the disk without a stream that you don't entirely control? — jfriend00
– jfriend00, Commented Jul 31, 2016 at 7:13
@jfriend00. Because these files doesn't have to be on my hard disk. Stream can be obtained from server file, from part of the other process memory or from some buffer. — user6659331
– user6659331, Commented Jul 31, 2016 at 7:19

Willem van Gerven · Accepted Answer · 2016-11-19 19:25:13Z

1

Maybe I'm missing something, but as far as I can tell, I don't see a reason why this couldn't be implemented using streams with a different syntax. I'd use

let chunk;
let Nbytes; // # of bytes to read into a chunk
stream.on('readable', ()=>{
  while(chunk = stream.read(Nbytes)!==null) { 
    // call whatever you like on the chunk of data of size Nbytes   
  }
})

Note that if you specify the size of the chunk yourself, like done here, null will be returned if the amount of bytes requested are not available at the end of the stream. This doesn't mean there is no data anymore to stream. So just be aware that you should expect back a 'trimmed' buffer object of size < Nbytes at the end of the file.

answered Nov 19, 2016 at 19:25

Willem van Gerven

1,6271 gold badge18 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Parsing huge binary files in Node.js

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related