I am trying to get millisecond (< second) response time for aggregation from a collection. However, it currently takes 3-5 seconds even for a small data size (~200MB). My expected data in production for this collection is around 100GB per shard.
I checked the following already
- The response time is same when I tried the query individually on each shard.
- Checked the profiling output and all I could see was high timeLockedMicros and numYield.
- Couldn't find anything unusual in MMS charts too.
I feel like overloooking something silly here. Any help to analyze this further is highly appreciated. My cluster and collection details are provided below
Cluster - 6 nodes of 34 GB, 4 core machines (AWS m2.2xlarge) Data Size 1,285 MB (213 MB per shard) No of records = 5.5 million (~1 M per shard)
Records sample
{
"_id" : {
"ItemID" : 105182,
"DeviceType" : 16,
"ItemVersionID" : 117971,
"Timestamp" : ISODate("2014-11-14T00:00:00Z"),
"RecordType" : 1
},
"Dim1ID" : 102260,
"Dim2ID" : 313,
"Dim3ID" : 1,
"actionType" : {
"1" : 66,
"47" : 66,
"42" : 72,
"46" : 130
}
}
Query
db.AggregateCollection.aggregate({ "$group" : { "_id" : { } , "type1" : { "$sum" : "$actionType.1"} , "type2" : { "$sum" : "$actionType.2"}}})
Profile stats (from one shard)
"keyUpdates" : 0,
"numYield" : 79,
"lockStats" : {
"timeLockedMicros" : {
"r" : NumberLong(2981456),
"w" : NumberLong(0)
},
"timeAcquiringMicros" : {
"r" : NumberLong(3756),
"w" : NumberLong(4)
}
},
"responseLength" : 157,
"millis" : 3268,
"execStats" : {
},
UPDATE Thank you Will for the prompt response. Apreciate it. I like your new data model and the indexing. However, I am afraid that is not suitable for my current data because, - 99% of records will have soem value for actionType.1 and - 99% of our queries will select actionType.1 So index on actiionType.K wouldn't help much I guess.
As you suggested in #2 & #3, We are already doing pre-aggregation using Spark cluster which upadtes MongoDb.
A little bit more about my query The query I shared earlier is just a sample one and is used just to benchmark the performance. My actual query will have $match on Timestamp and $group on one or more fileds. A typical production query will be for 30 days data. Currently my collection has only 15 days data. My objective is to get sub-second response time for 30 days data
BTW, I did some more analysis today I toook a dump of the shard and restored in a local mongo installed on my MacBook. The same query took only 2 seconds (took 4s in AWS isntance) that didn't make sense because AWS instance is at lest 4 times powerful than the MacBook (both CPU & Memory) MacBook Air - http://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i5-4250U+%40+1.30GHz AWS m2.2xlarge Instance - http://www.cpubenchmark.net/cpu.php?cpu=Intel+Xeon+E5-2665+%40+2.40GHz
I suspected fragmentation becuase the data in AWS mongo instance was populated over the last 15 days through an application. So I re-imported the dump on AWS mongo as a separate collection. Query on this new collection took 2s which is comparable to MAcBook speed. So fragmentation is one reason for sure. I am planning do more research on fragmentation later. Though defragmenting improved perfromace, the fact that it took same time as my MacBook didn't make sense as the AWS isntance is 4 times powerful. Then we looked at the cpu utilization and found that mongod instance uses only one CPU (out of 4) for query execution. We are now planning to install 4 shards on each machine to get around this. Please let me know if you see a better approach.
One more thing, I know that my query has to scan the entire collection but 2 seconds to scan ~200MB data seems very high to me. Is it expected or I am missing something?