57

Is the aggregation framework introduced in mongodb 2.2, has any special performance improvements over map/reduce?

If yes, why and how and how much?

(Already I have done a test for myself, and the performance was nearly same)

5
  • 1
    "nearly" the same? With which benchmarks? Your remark is basically pointless. And you are comparing cat and cows. In addition you know yourself that the MR is still limit to single-threading....so: pointless question and therefore -1 Commented Dec 17, 2012 at 6:01
  • @user1833746 It's a question, I don't want to explain my benchmarks. I asked to know new answers to this questioned. Please vote-up to allow others to answer. Commented Dec 17, 2012 at 6:59
  • have you seen this question (and answers)? stackoverflow.com/questions/12139149/… Commented Dec 17, 2012 at 8:57
  • @Asya Yes, see my benchmark below Commented Dec 17, 2012 at 9:24
  • 1
    Please refer this link for more understand. runnable.com/blog/… Commented Feb 12, 2021 at 10:22

2 Answers 2

66

Every test I have personally run (including using your own data) shows aggregation framework being a multiple faster than map reduce, and usually being an order of magnitude faster.

Just taking 1/10th of the data you posted (but rather than clearing OS cache, warming the cache first - because I want to measure performance of the aggregation, and not how long it takes to page in the data) I got this:

MapReduce: 1,058ms
Aggregation Framework: 133ms

Removing the $match from aggregation framework and {query:} from mapReduce (because both would just use an index and that's not what we want to measure) and grouping the entire dataset by key2 I got:

MapReduce: 18,803ms
Aggregation Framework: 1,535ms

Those are very much in line with my previous experiments.

Sign up to request clarification or add additional context in comments.

3 Comments

for additional comments on this see answer to stackoverflow.com/questions/12139149/…
Thanks for answering the first portion of the question! What about the second part? Why and how? Do you have something to add for that? Thank you for any input.
this is covered in the docs - but in a nutshell, aggregation runs natively in the server (C++), MapReduce spawns separate javascript thread(s) to run JS code.
9

My benchmark:

== Data Generation ==

Generate 4million rows (with python) easy with approximately 350 bytes. Each document has these keys:

  • key1, key2 (two random columns to test indexing, one with cardinality of 2000, and one with cardinality of 20)
  • longdata: a long string to increase size of each document
  • value: a simple number (const 10) to test aggregation

db = Connection('127.0.0.1').test # mongo connection
random.seed(1)
for _ in range(2):
    key1s = [hexlify(os.urandom(10)).decode('ascii') for _ in range(10)]
    key2s = [hexlify(os.urandom(10)).decode('ascii') for _ in range(1000)]
    baddata = 'some long date ' + '*' * 300
    for i in range(2000):
        data_list = [{
                'key1': random.choice(key1s),
                'key2': random.choice(key2s),
                'baddata': baddata,
                'value': 10,
                } for _ in range(1000)]
        for data in data_list:
            db.testtable.save(data)
Total data size was about 6GB in mongo. (and 2GB in postgres)

== Tests ==

I did some test, but one is enough to comparing results:

NOTE: Server is restarted, and OS cache is cleaned after each query, to ignore effect of caching.

QUERY: aggregate all rows with key1=somevalue (about 200K rows) and sum value for each key2

  • map/reduce 10.6 sec
  • aggreate 9.7 sec
  • group 10.3 sec

queries:

map/reduce:

db.testtable.mapReduce(function(){emit(this.key2, this.value);}, function(key, values){var i =0; values.forEach(function(v){i+=v;}); return i; } , {out:{inline: 1}, query: {key1: '663969462d2ec0a5fc34'} })

aggregate:

db.testtable.aggregate({ $match: {key1: '663969462d2ec0a5fc34'}}, {$group: {_id: '$key2', pop: {$sum: '$value'}} })

group:

db.testtable.group({key: {key2:1}, cond: {key1: '663969462d2ec0a5fc34'}, reduce: function(obj,prev) { prev.csum += obj.value; }, initial: { csum: 0 } })

3 Comments

group is not aggregation framework, it's part of map/reduce. That's why it has a reduce function. See the difference here: docs.mongodb.org/manual/reference/command/group and docs.mongodb.org/manual/reference/aggregation/#_S_group If you were using aggregation framework you would be call db.collection.aggregate( [ pipeline ] )
I have a suggestion: why don't you take out the query and run the same thing on your entire collection and see if there is a difference in performance.
another problem with your benchmark is you cleared OS cache? So you were measuring mostly the time it takes to page the data into RAM. It dwarfs the actual performance numbers, and it's not a realistic scenario.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.