Mongodb data storage performance - one doc with items in array vs multiple docs per item

Question

I have statistical data in a Mongodb collection saved for each record per day. For example my collection looks roughly like

{ record_id: 12345, date: Date(2011,12,13), stat_value_1:12345, stat_value_2:98765 }

Each record_id/date combo is unique. I query the collection to get statistics per record for a given date range using map-reduce.

As far as read query performance, is this strategy superior than storing one document per record_id containing an array of statistical data just like the above dict:

{ _id: record_id, stats: [
{ date: Date(2011,12,11), stat_value_1:39884, stat_value_2:98765 },
{ date: Date(2011,12,12), stat_value_1:38555, stat_value_2:4665 },
{ date: Date(2011,12,13), stat_value_1:12345, stat_value_2:265 },
]}

On the pro side I will need one query to get the entire stat history of a record without resorting to the slower map-reduce method, and on the con side I'll have to sum up the stats for a given date range in my application code and if a record outgrows is current padding size-wise there's some disc reallocation that will go on.

is there an upper cap to number of items for each record-id ? — DhruvPathak
– DhruvPathak, Commented Dec 13, 2011 at 12:44
no upper cap but realistically there will be no more than one or two year's worth of stats per record at the most (600-700 stats at the most) — Harel
– Harel, Commented Dec 13, 2011 at 12:45
some records will have very few as well. 600-700 is a upper limit (realistic terms, not enforced) — Harel
– Harel, Commented Dec 13, 2011 at 12:49

mnemosyn · Accepted Answer · 2011-12-13 13:43:29Z

2

I think this depends on the usage scenario. If the data set for a single aggregation is small like those 700 records and you want to do this in real-time, I think it's best to choose yet another option and query all individual records and aggregate them client-side. This avoids the Map/Reduce overhead, it's easier to maintain and it does not suffer from reallocation or size limits. Index use should be efficient and connection-wise, I doubt there's much of a difference: most drivers batch transfers anyway.

The added flexibility might come in handy, for instance if you want to know the stat value for a single day across all records (if that ever makes sense for your application). Should you ever need to store more stat_values, your maximum number of dates per records would go down in the subdocument approach. It's also generally easier to work with db documents rather than subdocuments.

Map/Reduce really shines if you're aggregating huge amounts of data across multiple servers, where otherwise bandwidth and client concurrency would be bottlenecks.

answered Dec 13, 2011 at 13:43

mnemosyn

46.4k6 gold badges80 silver badges84 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Harel Over a year ago

The problem is that in one request its common for me to fetch about 1000 records, each with aggregates stats for a specific date range. Doing it on the client is a good option and I'll look into it as far as client performance goes (its a JavaScript client).

mnemosyn Over a year ago

Hm, I see. If client performance is low and the stats only change on a daily basis, how about keeping larger aggregates (on a month or week basis)? A background worker could update these regularly, and the client only needs to aggregate e.g. 1 year doc, 1 month doc, 2 week docs and 3 day docs instead of > 400 docs

Harel Over a year ago

Yes we use such a cache at the moment but its only viable within a 4 hour window as the stats are updated every 4 hours. As well we cache custom date range filters which ultimately amounts to quite a lot of cached data. A test I've done with replacing map reduce with regular query and summing up in app-code also came up slow for a 1000 record set... I'm going to test teh 2nd approach of 1 document per record and see how that compares. No avoiding a good old A/B comparison it seems...

mnemosyn Over a year ago

Yeah, real test w/ real data is the only thing in performance you can bet on :)

jianpx · Accepted Answer · 2013-01-26 04:02:40Z

0

I think you can reference to here, and also see foursquare how to solve this kind of problem here . They are both valuable.

edited Jan 26, 2013 at 4:02

answered Jan 26, 2013 at 3:56

jianpx

3,3381 gold badge33 silver badges27 bronze badges

Collectives™ on Stack Overflow

Mongodb data storage performance - one doc with items in array vs multiple docs per item

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related