Mongodb aggregate on subdocument in array

Question

I am implementing a small application using mongodb as a backend. In this application I have a data structure where the documents will contain a field that contains an array of subdocuments.

I use the following use case as a basis: http://docs.mongodb.org/manual/use-cases/inventory-management/

As you can see from the example, each document have a field called carted, which is an array of subdocuments.

{
    _id: 42,
    last_modified: ISODate("2012-03-09T20:55:36Z"),
    status: 'active',
    items: [
        { sku: '00e8da9b', qty: 1, item_details: {...} },
        { sku: '0ab42f88', qty: 4, item_details: {...} }
    ]
}

This fits me perfect, except for one problem: I want to count each unique item (with "sku" as the unique identifier key) in the entire collection where each document adds the count by 1 (multiple instances of the same "sku" in the same document will still just count 1). E.g. I would like this result:

{ sku: '00e8da9b', doc_count: 1 }, { sku: '0ab42f88', doc_count: 9 }

After reading up on MongoDB, I am quite confused about how to do this (fast) when you have a complex schema as described above. If I have understood the otherwise excellent documentation correct, such operation may perhaps be achieved using either the aggregation framework or the map/reduce framework, but this is where I need some input:

Which framework would be better suited to achieve the result I am looking for, given the complexity of the structure?
What kind of indexes would be preferred in order to gain the best possible performance out of the chosen framework?

cirrus · Accepted Answer · 2012-10-25 20:02:31Z

15

MapReduce is slow, but it can handle very large data sets. The Aggregation framework on the other hand is a little quicker, but will struggle with large data volumes.

The trouble with your structure shown is that you need to "$unwind" the arrays to crack open the data. This means creating a new document for every array item and with the aggregation framework it needs to do this in memory. So if you have 1000 documents with 100 array elements it will need to build a stream of 100,000 documents in order to groupBy and count them.

You might want to consider seeing if there's a schema layout that will server your queries better, but if you want to do it with the Aggregation framework here's how you could do it (with some sample data so the whole script will drop into the shell);

db.so.remove();
db.so.ensureIndex({ "items.sku": 1}, {unique:false});
db.so.insert([
    {
        _id: 42,
        last_modified: ISODate("2012-03-09T20:55:36Z"),
        status: 'active',
        items: [
            { sku: '00e8da9b', qty: 1, item_details: {} },
            { sku: '0ab42f88', qty: 4, item_details: {} },
            { sku: '0ab42f88', qty: 4, item_details: {} },
            { sku: '0ab42f88', qty: 4, item_details: {} },
    ]
    },
    {
        _id: 43,
        last_modified: ISODate("2012-03-09T20:55:36Z"),
        status: 'active',
        items: [
            { sku: '00e8da9b', qty: 1, item_details: {} },
            { sku: '0ab42f88', qty: 4, item_details: {} },
        ]
    },
]);


db.so.runCommand("aggregate", {
    pipeline: [
        {   // optional filter to exclude inactive elements - can be removed    
            // you'll want an index on this if you use it too
            $match: { status: "active" }
        },
        // unwind creates a doc for every array element
        { $unwind: "$items" },
        {
            $group: {
                // group by unique SKU, but you only wanted to count a SKU once per doc id
                _id: { _id: "$_id", sku: "$items.sku" },
            }
        },
        {
            $group: {
                // group by unique SKU, and count them
                _id: { sku:"$_id.sku" },
                doc_count: { $sum: 1 },
            }
        }
    ]
    //,explain:true
})

Note that I've $group'd twice, because you said that an SKU can only count once per document, so we need to first sort out the unique doc/sku pairs and then count them up.

If you want the output a little different (in other words, EXACTLY like in your sample) we can $project them.

answered Oct 25, 2012 at 20:02

cirrus

5,6908 gold badges46 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

agnsaft Over a year ago

Cool. I will try your input tomorrow. Its really nice of you to explain what is going on. The MongoDB queries for aggregate can be a bit difficult to read. For my usage I guess I will have approximately 60 000 documents with about 400 000 items spread throughout the documents.

cirrus Over a year ago

With those kind of numbers I suspect you'll want to pre-calculate if you can, and that means MR rather than the AF. Unless you need live queries and you can pre-calc, you'll be much better off with MR.

Volodymyr Metlyakov · Accepted Answer · 2014-03-05 15:37:54Z

3

With the latest mongo build (it may be true for other builds too), I've found that slightly different version of cirrus's answer performs faster and consumes less memory. I don't know the details why, seems like with this version mongo somehow have more possibility to optimize the pipeline.

db.so.runCommand("aggregate", {
    pipeline: [
        { $unwind: "$items" },
        {
            $group: {
                // create array of unique sku's (or set) per id
                _id: { id: "$_id"},
                sku: {$addToSet: "$items.sku"}
            }
        },
        // unroll all sets
        { $unwind: "$sku" },
        {
            $group: {
                // then count unique values per each Id
                _id: { id: "$_id.id", sku:"$sku" },
                count: { $sum: 1 },
            }
        }
    ]
})

to match exactly the same format as asked in question, grouping by "_id" should be skipped

answered Mar 5, 2014 at 15:37

Volodymyr Metlyakov

5007 silver badges12 bronze badges

3 Comments

cirrus Over a year ago

Is it because you're not $matching on status?

Volodymyr Metlyakov Over a year ago

I don't think so (I did performance measuring on my own data, using the same approaches. And here $match is ommited for the sake of simplicity). Actually, $matching could even make performance better (if indexed properly), in case if it narrows data amount for further steps. I think it is because one big "$group" is more difficult to optimize for Mongo, than several smaller pipeline stages.

cirrus Over a year ago

yeah, that seems to make sense.

Collectives™ on Stack Overflow

Mongodb aggregate on subdocument in array

2 Answers 2

2 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related