2

I have statistical data in a Mongodb collection saved for each record per day. For example my collection looks roughly like

{ record_id: 12345, date: Date(2011,12,13), stat_value_1:12345, stat_value_2:98765 }

Each record_id/date combo is unique. I query the collection to get statistics per record for a given date range using map-reduce.

As far as read query performance, is this strategy superior than storing one document per record_id containing an array of statistical data just like the above dict:

{ _id: record_id, stats: [
{ date: Date(2011,12,11), stat_value_1:39884, stat_value_2:98765 },
{ date: Date(2011,12,12), stat_value_1:38555, stat_value_2:4665 },
{ date: Date(2011,12,13), stat_value_1:12345, stat_value_2:265 },
]}

On the pro side I will need one query to get the entire stat history of a record without resorting to the slower map-reduce method, and on the con side I'll have to sum up the stats for a given date range in my application code and if a record outgrows is current padding size-wise there's some disc reallocation that will go on.

3
  • is there an upper cap to number of items for each record-id ? Commented Dec 13, 2011 at 12:44
  • no upper cap but realistically there will be no more than one or two year's worth of stats per record at the most (600-700 stats at the most) Commented Dec 13, 2011 at 12:45
  • some records will have very few as well. 600-700 is a upper limit (realistic terms, not enforced) Commented Dec 13, 2011 at 12:49

2 Answers 2

2

I think this depends on the usage scenario. If the data set for a single aggregation is small like those 700 records and you want to do this in real-time, I think it's best to choose yet another option and query all individual records and aggregate them client-side. This avoids the Map/Reduce overhead, it's easier to maintain and it does not suffer from reallocation or size limits. Index use should be efficient and connection-wise, I doubt there's much of a difference: most drivers batch transfers anyway.

The added flexibility might come in handy, for instance if you want to know the stat value for a single day across all records (if that ever makes sense for your application). Should you ever need to store more stat_values, your maximum number of dates per records would go down in the subdocument approach. It's also generally easier to work with db documents rather than subdocuments.

Map/Reduce really shines if you're aggregating huge amounts of data across multiple servers, where otherwise bandwidth and client concurrency would be bottlenecks.

Sign up to request clarification or add additional context in comments.

4 Comments

The problem is that in one request its common for me to fetch about 1000 records, each with aggregates stats for a specific date range. Doing it on the client is a good option and I'll look into it as far as client performance goes (its a JavaScript client).
Hm, I see. If client performance is low and the stats only change on a daily basis, how about keeping larger aggregates (on a month or week basis)? A background worker could update these regularly, and the client only needs to aggregate e.g. 1 year doc, 1 month doc, 2 week docs and 3 day docs instead of > 400 docs
Yes we use such a cache at the moment but its only viable within a 4 hour window as the stats are updated every 4 hours. As well we cache custom date range filters which ultimately amounts to quite a lot of cached data. A test I've done with replacing map reduce with regular query and summing up in app-code also came up slow for a 1000 record set... I'm going to test teh 2nd approach of 1 document per record and see how that compares. No avoiding a good old A/B comparison it seems...
Yeah, real test w/ real data is the only thing in performance you can bet on :)
0

I think you can reference to here, and also see foursquare how to solve this kind of problem here . They are both valuable.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.