Analyzing MongoDb aggregation performance

Question

I am trying to get millisecond (< second) response time for aggregation from a collection. However, it currently takes 3-5 seconds even for a small data size (~200MB). My expected data in production for this collection is around 100GB per shard. I checked the following already
- The response time is same when I tried the query individually on each shard. - Checked the profiling output and all I could see was high timeLockedMicros and numYield. - Couldn't find anything unusual in MMS charts too. I feel like overloooking something silly here. Any help to analyze this further is highly appreciated. My cluster and collection details are provided below

Cluster - 6 nodes of 34 GB, 4 core machines (AWS m2.2xlarge) Data Size 1,285 MB (213 MB per shard) No of records = 5.5 million (~1 M per shard)

Records sample

{
    "_id" : {
        "ItemID" : 105182,
        "DeviceType" : 16,
        "ItemVersionID" : 117971,
        "Timestamp" : ISODate("2014-11-14T00:00:00Z"),
        "RecordType" : 1
    },
    "Dim1ID" : 102260,
    "Dim2ID" : 313,
    "Dim3ID" : 1,
    "actionType" : {
        "1" : 66,
        "47" : 66,
        "42" : 72,
        "46" : 130
    }
}

Query

db.AggregateCollection.aggregate({ "$group" : { "_id" : { } , "type1" : { "$sum" : "$actionType.1"} , "type2" : { "$sum" : "$actionType.2"}}})

Profile stats (from one shard)

"keyUpdates" : 0,
    "numYield" : 79,
    "lockStats" : {
        "timeLockedMicros" : {
            "r" : NumberLong(2981456),
            "w" : NumberLong(0)
        },
        "timeAcquiringMicros" : {
            "r" : NumberLong(3756),
            "w" : NumberLong(4)
        }
    },
    "responseLength" : 157,
    "millis" : 3268,
    "execStats" : {

    },

UPDATE Thank you Will for the prompt response. Apreciate it. I like your new data model and the indexing. However, I am afraid that is not suitable for my current data because, - 99% of records will have soem value for actionType.1 and - 99% of our queries will select actionType.1 So index on actiionType.K wouldn't help much I guess.

As you suggested in #2 & #3, We are already doing pre-aggregation using Spark cluster which upadtes MongoDb.

A little bit more about my query The query I shared earlier is just a sample one and is used just to benchmark the performance. My actual query will have $match on Timestamp and $group on one or more fileds. A typical production query will be for 30 days data. Currently my collection has only 15 days data. My objective is to get sub-second response time for 30 days data

BTW, I did some more analysis today I toook a dump of the shard and restored in a local mongo installed on my MacBook. The same query took only 2 seconds (took 4s in AWS isntance) that didn't make sense because AWS instance is at lest 4 times powerful than the MacBook (both CPU & Memory) MacBook Air - http://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i5-4250U+%40+1.30GHz AWS m2.2xlarge Instance - http://www.cpubenchmark.net/cpu.php?cpu=Intel+Xeon+E5-2665+%40+2.40GHz

I suspected fragmentation becuase the data in AWS mongo instance was populated over the last 15 days through an application. So I re-imported the dump on AWS mongo as a separate collection. Query on this new collection took 2s which is comparable to MAcBook speed. So fragmentation is one reason for sure. I am planning do more research on fragmentation later. Though defragmenting improved perfromace, the fact that it took same time as my MacBook didn't make sense as the AWS isntance is 4 times powerful. Then we looked at the cpu utilization and found that mongod instance uses only one CPU (out of 4) for query execution. We are now planning to install 4 shards on each machine to get around this. Please let me know if you see a better approach.

One more thing, I know that my query has to scan the entire collection but 2 seconds to scan ~200MB data seems very high to me. Is it expected or I am missing something?

Your aggregation pipeline is processing every document in the collection. It's not going to be fast if it has to hit every document. You are just summing up a couple of field values, so the easier thing to do is track the running totals when you do inserts/updates. — wdberkeley
– wdberkeley, Commented Dec 2, 2014 at 8:31
Correct but this is intentional. I have only 15 days data (~200MB) that I am trying to benchmark. — BKSB
– BKSB, Commented Dec 3, 2014 at 0:54

Will Shaver · Accepted Answer · 2014-12-02 06:55:44Z

Things I'd try:

1) You've organized your data in a way that makes grouping very difficult. You may get better results if you organize your document like this:

{
    ...
    "actionType" : [{k:1, v:66}, {k:47, v:66}, {k:42, v:72}, {k:46, v:130}]
}

This would allow you to create an index on 'actionType.k'. Then you could do a match on that index to reduce your overall dataset to the exact actionTypes you want for this aggregation, where your query is:

db.action.aggregate([{$unwind: '$actionType'}, 
        {$group:{_id:'$actionType.k', t:{$sum:'$actionType.v'} } }]);
//output
{ "_id" : 46, "t" : 130 }
{ "_id" : 42, "t" : 72 }
{ "_id" : 47, "t" : 66 }
{ "_id" : 1, "t" : 66 }

Then ensureIndex on 'actionType.k'. If you're not planning to filter for ALL the different key values, the index will help quite a bit depending on the density of the keys in your documents. If you're planning to sum every key, the index won't help here.

2) Map-reduce and/or add these on a cron-job/setTimeout schedule. Again, depending on your update cycle and how accurate you need the data at any one time, set up something like:

Every hour process all 'dirty' results
Add current values to the running total
Mark as 'clean'

That works if you're doing insert-only to this db.

3) If the key values are changing regularly (updates instead of inserts), you might have better luck doing a changelog insert that happens at the same time as your updates to the main collection.

db.changes.insert({key:44, change:2});
db.changes.insert({key:34, change:-2});

Then routinely empty out the 'changes' collection summing the values to a different collection.

Collectives™ on Stack Overflow

Analyzing MongoDb aggregation performance

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related