How to dump large data sets in mongodb from node.js

Question

I'm trying to dump approx 2.2 million objects in mongodb (using mongoose). The problem is when I save all the objects one by one It gets stuck. I've kepts a sample code below. If I run this code for 50,000 it works great. But if I increase data size to approx 500,000 it gets stuck.I want to know what is wrong with this approach and I want to find a better way to do this. I'm quite new to nodejs. I've tried loop's and everything no help finally I found this kind of solution. This one works fine for 50k objects but gets stuck for 2.2 Million objects. and I get this after sometime

FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory Aborted (core dumped)

var connection = mongoose.createConnection("mongodb://localhost/entity");
var entitySchema = new mongoose.Schema({
 name: String
 , date: Date
 , close : Number
 , volume: Number
 , adjClose: Number
  });

 var Entity = connection.model('entity', entitySchema)
    var mongoobjs =["2.2 Millions obejcts here populating in code"] // works completely fine till here

    async.map(mongoobjs, function(object, next){

        Obj = new Entity({
        name : object.name
      , date: object.date
      , close : object.close
      , volume: object.volume
      , adjClose: object.adjClose
    });
    Obj.save(next);


}, function(){console.log("Saved")});

Ok So I achieved it successfully using async.eachSeries.. but it is way to slow anything which is much faster than this? — user1658222
– user1658222, Commented Nov 10, 2015 at 11:10

user1658222 · Accepted Answer · 2015-11-15 08:39:17Z

1

Thanks cdbajorin

This seem to be much better way and a little faster batch approach for for doing this. So what I learned was that in my earlier approach, "new Entity(....)" was taking time and causing memory overflow. Still not sure why.

So, What I did was rather than using this line

 Obj = new Entity({
    name : object.name
  , date: object.date
  , close : object.close
  , volume: object.volume
  , adjClose: object.adjClose
});

I just created JSON objects and stored in an array.

stockObj ={
    name : object.name
  , date: object.date
  , close : object.close
  , volume: object.volume
  , adjClose: object.adjClose
};
   mongoobjs.push(stockObj); //array of objs.

and used this command... and Voila It worked !!!

Entity.collection.insert(mongoobjs, function(){ console.log("Saved succesfully")});

edited Nov 15, 2015 at 8:39

answered Nov 15, 2015 at 8:31

user1658222

831 silver badge8 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

chrishiestand · Accepted Answer · 2015-11-11 03:03:56Z

nodejs uses v8 which has the unfortunate property, from the perspective of developers coming from other interpreted languages, of severely restricting the amount of memory you can use to something like 1.7GB regardless of available system memory.

There is really only one way, afaik, to get around this - use streams. Precisely how you do this is up to you. For example, you can simply stream data in continuously, process it as it's coming in, and let the processed objects get garbage collected. This has the downside of being difficult to balance input to output.

The approach we've been favoring lately is to have an input stream bring work and save it to a queue (e.g. an array). In parallel you can write a function that is always trying to pull work off the queue. This makes it easy to separate logic and throttle the input stream in case work is coming in (or going out) too quickly.

Say for example, to avoid memory issues, you want to stay below 50k objects in the queue. Then your stream-in function could pause the stream or skip the get() call if the output queue has > 50k entries. Similarly, you might want to batch writes to improve server efficiency. So your output processor could avoid writing unless there are at least 500 objects in the queue or if it's been over 1 second since the last write.

This works because javascript uses an event loop which means that it will switch between asynchronous tasks automatically. Node will stream data in for some period of time then switch to another task. You can use setTimeout() or setInterval() to ensure that there is some delay between function calls, thereby allowing another asynchronous task to resume.

Specifically addressing your problem, it looks like you are individually saving each object. This will take a long time for 2.2 million objects. Instead, there must be a way to batch writes.

mongoose can't batch writes, but you can bypass mongoose and use the node driver by calling Model.collection.insert(objects, callback).

P.M · Accepted Answer · 2019-03-08 02:20:19Z

As an addition to answers provided in this thread, I was successful with

Bulk Insert (or batch insertion of ) 20.000+ documents (or objects)
Using low memory (250 MB) available within cheap offerings of Heroku
Using one instance, without any parallel processing

The Bulk operation as specified with MongoDB native driver was used, and the following is the code-ish that worked for me:

var counter = 0;
var entity= {}, entities = [];// Initialize Entities from a source such as a file, external database etc
var bulk = Entity.collection.initializeOrderedBulkOp();
var size = MAX_ENTITIES; //or `entities.length` Defined in config, mine was 20.0000 
//while and -- constructs is deemed faster than other loops available in JavaScript ecosystem
while(size--){
    entity = entities[size];
    if( entity && entity.id){
        // Add `{upsert:true}` parameter to create if object doesn't exist
         bulk.find({id: entity.id}).update({$set:{value:entity.value}});
    }
    console.log('processing --- ', entity, size);
}
bulk.execute(function (error) {
    if(error) return next(error);
    return next(null, {message: 'Synced vector data'});                 
});

Entity is a mongoose model. Old versions of mongodb may not support Entity type as it was made available from version 3+.

I hope this answer helps someone.

Thanks.

Collectives™ on Stack Overflow

How to dump large data sets in mongodb from node.js

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related