Query performance issue for large nested data in mongodb

Question

I'm trying to query results from a large dataset called 'tasks' containing 187297 documents which are nested into another dataset called 'workers', that's in its turn nested into a collection called 'production_units'.

production_units -> workers -> tasks

(BTW this is a simplified version of production_units):

[{
    "_id": ObjectId("5aca27b926974863ed9f01ab"),
    "name": "Z",
    "workers": [{
        "name": "X Y",
        "worker_number": 655,
        "employed": false,
        "_id": ObjectId("5aca27bd26974863ed9f0425"),
        "tasks": [{
            "_id": ObjectId("5ac9f6c2e1a668d6d39c1fd1"),
            "inbound_order_number": 3296,
            "task_number": 90,
            "minutes_elapsed": 120,
            "date": "2004-11-30",
            "start": 1101823200,
            "pieces_actual": 160,
            "pause_from": 1101812400,
            "pause_to": 1101814200
        }]
    }]
}]

In order to accomplish this I have used the following aggregation command:

db.production_units.aggregate([{
    '$project': {
        'workers': '$workers'
    }
}, {
    '$unwind': '$workers'
}, {
    '$project': {
        'tasks': '$workers.tasks',
        'worker_number': '$workers.worker_number'
    }
}, {
    '$unwind': '$tasks'
}, {
    '$project': {
        'task_number': '$tasks.task_number',
        'pieces_actual': '$tasks.pieces_actual',
        'minutes_elapsed': '$tasks.minutes_elapsed',
        'worker_number': 1,
        'start': '$tasks.start',
        'inbound_order_number': '$tasks.inbound_order_number',
        'pause_from': '$tasks.pause_from',
        'date': '$tasks.date',
        '_id': '$tasks._id',
        'pause_to': '$tasks.pause_to'
    }
}, {
    '$match': {
        'start': {
            '$exists': true
        }
    }
}, {
    '$group': {
        'entries_count': {
            '$sum': 1
        },
        '_id': null,
        'entries': {
            '$push': '$$ROOT'
        }
    }
}, {
    '$project': {
        'entries_count': 1,
        '_id': 0,
        'entries': 1
    }
}, {
    '$unwind': '$entries'
}, {
    '$project': {
        'task_number': '$entries.task_number',
        'pieces_actual': '$entries.pieces_actual',
        'minutes_elapsed': '$entries.minutes_elapsed',
        'worker_number': '$entries.worker_number',
        'start': '$entries.start',
        'inbound_order_number': '$entries.inbound_order_number',
        'pause_from': '$entries.pause_from',
        'date': '$entries.date',
        'entries_count': 1,
        '_id': '$entries._id',
        'pause_to': '$entries.pause_to'
    }
}, {
    '$sort': {
        'start': 1
    }
}, {
    '$skip': 187290
}, {
    '$limit': 10
}], {
    allowDiskUse: true
})

And the returned documents are:

{ "entries_count" : 187297, "task_number" : 100, "pieces_actual" : 68, "minutes_elapsed" : 102, "worker_number" : 411, "start" : 1594118400, "inbound_order_number" : 8569, "pause_from" : 1594119600, "date" : "2020-07-07", "_id" : ObjectId("5ac9f6d3e1a668d6d3a06351"), "pause_to" : 1594119600 } { "entries_count" : 187297, "task_number" : 130, "pieces_actual" : 20, "minutes_elapsed" : 30, "worker_number" : 549, "start" : 1596531600, "inbound_order_number" : 7683, "pause_from" : 1596538800, "date" : "2020-08-04", "_id" : ObjectId("5ac9f6cde1a668d6d39f1b26"), "pause_to" : 1596538800 } { "entries_count" : 187297, "task_number" : 210, "pieces_actual" : 84, "minutes_elapsed" : 180, "worker_number" : 734, "start" : 1601276400, "inbound_order_number" : 8330, "pause_from" : 1601290800, "date" : "2020-09-28", "_id" : ObjectId("5ac9f6d0e1a668d6d39fd677"), "pause_to" : 1601290800 } { "entries_count" : 187297, "task_number" : 20, "pieces_actual" : 64, "minutes_elapsed" : 90, "worker_number" : 114, "start" : 1601800200, "inbound_order_number" : 7690, "pause_from" : 1601809200, "date" : "2020-10-04", "_id" : ObjectId("5ac9f6cee1a668d6d39f3032"), "pause_to" : 1601811900 } { "entries_count" : 187297, "task_number" : 140, "pieces_actual" : 70, "minutes_elapsed" : 84, "worker_number" : 49, "start" : 1603721640, "inbound_order_number" : 4592, "pause_from" : 1603710000, "date" : "2020-10-26", "_id" : ObjectId("5ac9f6c8e1a668d6d39df664"), "pause_to" : 1603712700 } { "entries_count" : 187297, "task_number" : 80, "pieces_actual" : 20, "minutes_elapsed" : 30, "worker_number" : 277, "start" : 1796628600, "inbound_order_number" : 4655, "pause_from" : 1796641200, "date" : "2026-12-07", "_id" : ObjectId("5ac9f6c8e1a668d6d39e1fc0"), "pause_to" : 1796643900 } { "entries_count" : 187297, "task_number" : 40, "pieces_actual" : 79, "minutes_elapsed" : 123, "worker_number" : 96, "start" : 3802247580, "inbound_order_number" : 4592, "pause_from" : 3802244400, "date" : "2090-06-27", "_id" : ObjectId("5ac9f6c8e1a668d6d39de218"), "pause_to" : 3802244400 }

However, the query takes seconds in order to show the results, instead of few milliseconds. This is the result returned by the profiler:

 db.system.profile.findOne().millis 3216

(UPDATE)

Even the following simplified count query gets executed in 312 ms instead of few time:

db.production_units.aggregate([{
        "$unwind": "$workers"
    }, {
        "$unwind": "$workers.tasks"
    },
    {
        "$count": "entries_count"
    }
])

This is what explain() returns for the query above:

{
    "stages" : [
        {
            "$cursor" : {
                "query" : {

                },
                "fields" : {
                    "workers" : 1,
                    "_id" : 0
                },
                "queryPlanner" : {
                    "plannerVersion" : 1,
                    "namespace" : "my_db.production_units",
                    "indexFilterSet" : false,
                    "parsedQuery" : {

                    },
                    "winningPlan" : {
                        "stage" : "COLLSCAN",
                        "direction" : "forward"
                    },
                    "rejectedPlans" : [ ]
                },
                "executionStats" : {
                    "executionSuccess" : true,
                    "nReturned" : 28,
                    "executionTimeMillis" : 13,
                    "totalKeysExamined" : 0,
                    "totalDocsExamined" : 28,
                    "executionStages" : {
                        "stage" : "COLLSCAN",
                        "nReturned" : 28,
                        "executionTimeMillisEstimate" : 0,
                        "works" : 30,
                        "advanced" : 28,
                        "needTime" : 1,
                        "needYield" : 0,
                        "saveState" : 1,
                        "restoreState" : 1,
                        "isEOF" : 1,
                        "invalidates" : 0,
                        "direction" : "forward",
                        "docsExamined" : 28
                    },
                    "allPlansExecution" : [ ]
                }
            }
        },
        {
            "$unwind" : {
                "path" : "$workers"
            }
        },
        {
            "$unwind" : {
                "path" : "$workers.tasks"
            }
        },
        {
            "$group" : {
                "_id" : {
                    "$const" : null
                },
                "entries_count" : {
                    "$sum" : {
                        "$const" : 1
                    }
                }
            }
        },
        {
            "$project" : {
                "_id" : false,
                "entries_count" : true
            }
        }
    ],
    "ok" : 1
}

I'm not an experienced DBA, so I don't know what I'm missing exactly in my aggregation pipeline, for solving the performance issue I'm facing. I have also investigated the problem and made research, but without finding any solution.

What I am missing?

Have you tried to place the $match at the beginning of your aggregation ? It will reduce the number of documents that pass your different stages — Nicolas
– Nicolas, Commented Apr 10, 2018 at 9:54
It's a bit hard to understand what you're trying to achieve with your query, could you add a sample document of your collection, and the expected result ? — felix
– felix, Commented Apr 10, 2018 at 9:54
@felix The query concatenates the documents of tasks that are nested for each worker altogether, and then returns the last 10 documents (to fit pagination). Are the posted documents enough, or I need to provide more information? — user6039980
– user6039980, Commented Apr 10, 2018 at 10:11
@Nicolas Sorry, but I don't understand what you mean exactly. do you mean an empty $match (i.e $match: {})? — user6039980
– user6039980, Commented Apr 10, 2018 at 10:12
I mean, you can try to put at first time `$match: {"workers.tasks.start": {$exists: true}} to filter the documents you need — Nicolas
– Nicolas, Commented Apr 10, 2018 at 10:16

felix · Accepted Answer · 2018-04-10 13:45:12Z

3

without the explain() of the query it's impossible to know for sure what is the bottleneck of the query. However, here are some advices on how to improve this query

Use a single `$project` stage at the end of the pipeline

the query contains 5 $project stage, when actually only one is needed. This can add a lot of overhead, especially if applied to a large number of document. Instead, use dot notation to query nested fields, for example:

{ "$unwind": "$workers.tasks" }

Call `$match` as early as possible

$match allows to remove some of the documents, so add it as early as possible to apply further aggregation stage on a lower number of documents

Call `skip` and `$limit` before `$project`

As the query returns only 10 documents, no need to apply the $project stage on the 180000 other docs

Properly index the field used for sorting

This is likely to be the bottleneck. Make sure that the field workers.tasks.start is indexed ( see MongoDB ensureIndex() for details )

Do not compute the nb of documents returned in the query

Instead of the $group/$unwind stage to count matching documents, run another query in the same time for counting only the number of matching documents

the main query now looks like:

db.collection.aggregate([{
        "$unwind": "$workers"
    }, {
        "$unwind": "$workers.tasks"
    }, {
        "$match": {
            "workers.tasks.start": {
                "$ne": null
            }
        }
    },
    {
        "$sort": {
            "workers.tasks.start": 1
        }
    }, {
        "$skip": 0
    }, {
        "$limit": 10
    },
    {
        "$project": {
            "task_number": "$workers.tasks.task_number",
            "pieces_actual": "$workers.tasks.pieces_actual",
            "minutes_elapsed": "$workers.tasks.minutes_elapsed",
            "worker_number": "$workers.worker_number",
            "start": "$workers.tasks.start",
            "inbound_order_number": "$workers.tasks.inbound_order_number",
            "pause_from": "$workers.tasks.pause_from",
            "date": "$workers.tasks.date",
            "_id": "$workers.tasks._id",
            "pause_to": "$workers.tasks.pause_to"
        }
    }
])

you can try it here: mongoplayground.net/p/yua7qspo2Jj

the count query would be

db.collection.aggregate([{
        "$unwind": "$workers"
    }, {
        "$unwind": "$workers.tasks"
    }, {
        "$match": {
            "workers.tasks.start": {
                "$ne": null
            }
        }
    },
    {
        "$count": "entries_count"
    }
])

the count query would look like

answered Apr 10, 2018 at 13:45

felix

9,3158 gold badges44 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user6039980 Over a year ago

Thanks for the great answer. But after testing the queries you pointed above, I really don't know why the count query is taking 312ms, while the main query is only taking 26ms. What's causing the overhead for the $count?

felix Over a year ago

@Kais it's hard to say... You may want to take a look a the explain() of the query, this will tell you exactly where mongodb is spending time

user6039980 Over a year ago

I updated my question in order to provide the results of explain().

felix Over a year ago

@Ashish currently it's not possible in json mode, but you can achieve this in mgodatagen mode (for example, mongoplayground.net/p/yIEueXLsH6R ). If you find any bug, please report them here: github.com/feliixx/mongoplayground/issues !

Collectives™ on Stack Overflow

Query performance issue for large nested data in mongodb

1 Answer 1

Use a single `$project` stage at the end of the pipeline

Call `$match` as early as possible

Call `skip` and `$limit` before `$project`

Properly index the field used for sorting

Do not compute the nb of documents returned in the query

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Use a single $project stage at the end of the pipeline

Call $match as early as possible

Call skip and $limit before $project

Properly index the field used for sorting

Do not compute the nb of documents returned in the query

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Use a single `$project` stage at the end of the pipeline

Call `$match` as early as possible

Call `skip` and `$limit` before `$project`