Working with large JSON data sets in memory using Node

Question

I am pulling JSON data from Salesforce. I can have roughly about 10 000 records, but never more. In order to prevent Api limits and having to hit Salesforce for every request, I thought I could query the data every hour and then store it in memory. Obviously this will be much much quicker, and much less error prone.

A JSON object would have about 10 properties and maybe one other nested JSON object with two or three properties.

I am using methods similar to below to query the records.

getUniqueProperty: function (data, property) {
    return _.chain(data)
        .sortBy(function(item) { return item[property]; })
        .pluck(property)
        .uniq()
        .value();
}

My questions are

What would the ramifications be by storing the data into memory and working with the data in memory? I obviously don't want to block the sever by running heavy filtering on the data.
I have never used redis before, but would something like a caching db help?
Would it be best to maybe query the data every hour, and store the JSON response in something like Mongo. I would then do all my querying against Mongo as opposed to in-memory? Every hour I query Salesforce, I just flush the database and reinsert the data.

Assuming that your salesforce data is being updated during that hour, all your requests will be out of date until the next update. — Andy
– Andy, Commented Feb 8, 2014 at 16:55
Not at all worried about the data being out of date. It can be out of date for the timeframe. It's probably only going to be updated and need to be pulled through every couple of hours anyway. — TYRONEMICHAEL
– TYRONEMICHAEL, Commented Feb 8, 2014 at 17:19

vkurchatkin · Accepted Answer · 2014-02-08 16:54:19Z

1

Storing your data in memory has a couple of disadvantages:

non-scalable — when you decide to use more processes, each process will need to make same api request;
fragile — if your process crashes you will lose the data.

Also working with large amount of data can block process for longer time than you would like.

Solution: - use external storage! It can be redis, or MongoDB or RDBMS; - update data in separate process, triggered with cron; - don't drop the whole database: there is a chance that someone will make a request right after that (if your storage doesn't support transactions, of course), update records.

answered Feb 8, 2014 at 16:54

vkurchatkin

13.6k2 gold badges51 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

TYRONEMICHAEL Over a year ago

I have briefly looked at redis. Is it not impossible to do rich querying on the data since it is a key value store? So for example, I won't be able to query the JSON data where, lets say, vehicleMake is Toyota? I thought about updating records, but thats where things get quite complex. I just need the data relevant to the application since all the data is stored on Salesforce anyway. If I lose the data, I just query Salesforce to get the relevant data again and work with that. Can I not spawn a child process for complex querying?

vkurchatkin Over a year ago

@TyroneMichael if you need complex queries MongoDB or RDBMS is a good choice. If you will spawn a child for each query, then you'll have to deal with overhead of passing your data via IPC every time or requesting it from salesforce. If you will have a demon query process, it will be basically reinventing DBMS.

Collectives™ on Stack Overflow

Working with large JSON data sets in memory using Node

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related