3

I have 100 documents in my mongoDB, assuming each of them are possible duplicate with other document(s) in different conditions, such as firstName & lastName, email and mobile phone.

I am trying to mapReduce these 100 documents to have the key-value pairs, like grouping.

Everything works fine until I have the 101st duplicate records in the DB.

The output of the mapReduce result for the other documents which are duplicate with the 101st records are corrupted.

For example:

I am working on firstName & lastName now.

When the DB contains 100 documents, I can have the result containing

{
    _id: {
        firstName: "foo",
        lastName: "bar,
    },
    value: {
        count: 20
        duplicate: [{
            id: ObjectId("/*an object id*/"),
            fullName: "foo bar",
            DOB: ISODate("2000-01-01T00:00:00.000Z")
        },{
            id: ObjectId("/*another object id*/"),
            fullName: "foo bar",
            DOB: ISODate("2000-01-02T00:00:00.000Z")
        },...]
    },

}

It is what exactly I want, but...

when the DB contains more than 100 possible duplicated documents, the result became like this,

Let's say the 101st documents is

{
    firstName: "foo",
    lastName: "bar",
    email: "[email protected]",
    mobile: "019894793"
}

containing 101 documents:

{
    _id: {
        firstName: "foo",
        lastName: "bar,
    },
    value: {
        count: 21
        duplicate: [{
            id: undefined,
            fullName: undefined,
            DOB: undefined
        },{
            id: ObjectId("/*another object id*/"),
            fullName: "foo bar",
            DOB: ISODate("2000-01-02T00:00:00.000Z")
        }]
    },

}

containing 102 documents:

{
    _id: {
        firstName: "foo",
        lastName: "bar,
    },
    value: {
        count: 22
        duplicate: [{
            id: undefined,
            fullName: undefined,
            DOB: undefined
        },{
            id: undefined,
            fullName: undefined,
            DOB: undefined
        }]
    },

}

I found another topic on stackoverflow having the similar issue like me, but the answer does not work for me MapReduce results seem limited to 100?

Any ideas?

Edit:

Original source code:

var map = function () {
    var value = {
        count: 1,
        userId: this._id
    };
    emit({lastName: this.lastName, firstName: this.firstName}, value);
};

var reduce = function (key, values) {
    var reducedObj = {
        count: 0,
        userIds: []
    };
    values.forEach(function (value) {
        reducedObj.count += value.count;
        reducedObj.userIds.push(value.userId);
    });
    return reducedObj;
};

Source code now:

var map = function () {
    var value = {
        count: 1,
        users: [this]
    };
    emit({lastName: this.lastName, firstName: this.firstName}, value);
};

var reduce = function (key, values) {
    var reducedObj = {
        count: 0,
        users: []
    };
    values.forEach(function (value) {
        reducedObj.count += value.count;
        reducedObj.users = reducedObj.users.concat(values.users); // or using the forEach method

        // value.users.forEach(function (user) {
        //     reducedObj.users.push(user);
        // });

    });
    return reducedObj;
};

I don't understand why it would fail as I was also pushing a value (userId) to reducedObj.userIds.

Are there some problems about the value that I emitted in map function?

1
  • 1
    Do your map and reduce functions product objects with the exact same shape? See stackoverflow.com/questions/14138344/…. If you're still stuck, please edit your question to include your map and reduce functions. Commented Jan 26, 2015 at 14:07

1 Answer 1

7

Explaining the problem


This is a common mapReduce trap, but clearly part of the problem you have here is that the questions you are finding don't have answers that explain this clearly or even properly. So an answer is justified here.

The point in the documentation that is often missed or at least misunderstood is here in the documentation:

  • MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key.

And adding to that just a little later down the page:

  • the type of the return object must be identical to the type of the value emitted by the map function.

What this means in the context of your question is that at a certain point there are "too many" duplicate key values being passed in for a reduce stage to act on this in one single pass as it will be able to do for a lower number of documents. By design the reduce method is called multiple times, often taking the "output" from data that is already reduced as part of it's "input" for yet another pass.

This is how mapReduce is designed to handle very large datasets, by processing everything in "chunks" until it finally "reduces" down to a singular grouped result per key. This is why the next statement is important is that what comes out of both emit and the reduce output needs to be structured exactly the same in order for the reduce code to handle it correctly.

Solving the problem


You correct this by fixing up how you are both emitting the data in the map and how you also return and process in the reduce function:

db.collection.mapReduce(
    function() {
        emit(
            { "firstName": this.firstName, "lastName": this.lastName },
            { "count": 1, "duplicate": [this] } // Note [this]
        )
    },
    function(key,values) {
        var reduced = { "count": 0, "duplicate": [] };
        values.forEach(function(value) {
            reduced.count += value.count;
            value.duplicate.forEach(function(duplicate) {
                reduced.duplicate.push(duplicate);
            });
        });

        return reduced;
    },
    { 
       "out": { "inline": 1 },
    }
)

The key points can be seen in both the content to emit and the first line of the reduce function. Essentially these present a structure that is the same. In the case of the emit it does not matter that the array being produced only has a singular element, but you send it that way anyhow. Side by side:

    { "count": 1, "duplicate": [this] } // Note [this]
    // Same as
    var reduced = { "count": 0, "duplicate": [] };

That also means that the remainder of the reduce function will always assume that the "duplicate" content is in fact an array, because that is how it came as original input and is also how it will be returned:

        values.forEach(function(value) {
            reduced.count += value.count;
            value.duplicate.forEach(function(duplicate) {
                reduced.duplicate.push(duplicate);
            });
        });

        return reduced;

Alternate Solution


The other reason for an answer is that considering the output you are expecting, this would in fact be much better suited to the aggregation framework. It's going to do this a lot faster than mapReduce can, and is even far more simple to code up:

db.collection.aggregate([
    { "$group": {
       "_id": { "firstName": "$firstName", "lastName": "$lastName" },
       "duplicate": { "$push": "$$ROOT" },
       "count": { "$sum": 1 }
    }},
    { "$match": { "count": { "$gt": 1 } }}
])

That's all it is. You can write out to a collection by adding an $out stage to this where required. But basically either mapReduce or aggregate, you are still placing the same 16MB restriction on the document size by adding your "duplicate" items into an array.

Also note that you can simply do something that mapReduce cannot here, and just "omit" any items that are not in fact a "duplicate" from the results. The mapReduce method cannot do this without first producing output to a collection and then "filtering" the results in a separate query.

That core documentation itself quotes:

NOTE
For most aggregation operations, the Aggregation Pipeline provides better performance and more coherent interface. However, map-reduce operations provide some flexibility that is not presently available in the aggregation pipeline.

So it's really a case of weighing up which is better suited to the problem at hand.

Sign up to request clarification or add additional context in comments.

5 Comments

The mapReduce solution works for me, but I am still not sure why it would fail, please take a look of the edited question. Must I emit the value with an array property in the map function?
p.s. "out": { "inline": 1 } does not work, it should be "out": "inline" (I am using mongoDB 2.6.7)
@kit Nope the format of "out": { } is exactly like that. Other options include "replace" and "merge" as the "key" with the "value" as the "collection name". So I have it right and you have typed something different. Unless you're talking about pymongo implementations, which does this differently.
Nevermind, maybe both are correct. In case I am using Robomongo to run the script.
Your answer finally explained this to me. Thank you

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.