2

I am completely new to MongoDB and wanted to compare query performance of a NoSQL data model relative to its relational database counter part. I wrote this into MongoDB shell

// Make 10 businesses
// Each business has 10 locations
// Each location has 10 departments
// Each department has 10 teams
// Each team has 100 employees
(new Array(10)).fill(0).forEach(_=>
    db.businesses.insert({
        "name":"Business Name",
        "locations":(new Array(10)).fill(0).map(_=>({
            "name":"Office Location",
            "departments":(new Array(10)).fill(0).map(_=>({
                "name":"Department",
                "teams":(new Array(10)).fill(0).map(_=>({
                    "name":"Team Name",
                    "employees":(new Array(100)).fill(0).map(_=>({
                        "age":Math.floor(Math.random()*100)
                    }))
                }))
            }))
        }))
    })
);

Then I attempted the equivalent of MySQL's EXPLAIN SELECT age,name,(and a few other fields) FROM employees WHERE age >= 50 ORDER BY age DESC by writing this statement:

db.businesses.aggregate([
    { $unwind: "$locations" },
    { $unwind: "$locations.departments" },
    { $unwind: "$locations.departments.teams" },
    { $unwind: "$locations.departments.teams.employees" },
    { $project: { _id: 0, age: "$locations.departments.teams.employees.age" } },
    { $match: { "age": { $gte: 50 }} },
    { $sort: {"age" : -1}}
]).explain("executionStats")

The result was:

"errmsg" : "Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in.",

So I deleted the sort clause and try to get an explain. But the result was:

TypeError: db.businesses.aggregate(...).explain is not a function

So my questions are:

  1. Primarily, I want to know the performance difference of SELECT age FROM employees WHERE age >= 50 ORDER BY age DESC when compared with the MongoDB's aggregate query counter part. Is it more or less the same? Will one be substantially faster or more performant than the other?

  2. Alternatively, how do I fix my MongoDB query so that I can get performance details to compare against my MySQL query counter part?

4 Answers 4

2

Employees are single entities; thus, you probably don't want to model age of a team member so deeply in the rich structure of departments and locations and teams. It is perfectly fine to have a separate employees collection and simply do:

db.businesses.aggregate([
{$match: {"age": {$gt: 50} }}
,{$sort: {"age": -1} }
]);

Deep in your businesses collection you can have:

{ teams: [ {name: "T1", employees: [ "E1", "E34" ]} ] }

Alternately, try this:

db.businesses.aggregate([ your pipeline] ,{allowDiskUse:true});

The OP has a setup of 10 biz -> 10 loc -> 10 depts -> 10 teams -> 100 emps. The first 3 unwinds creates a 10000x explosion of data but the last one is 100x beyond that. We can shrink the hit by using $filter:

db.businesses.aggregate([
{ $unwind: "$locations" },
{ $unwind: "$locations.departments" },
{ $unwind: "$locations.departments.teams" },

{$project: {
        XX: {$filter: {
                    input: "$locations.departments.teams.employees",
                    as: "z",
                    cond: {$gte: [ "$$z.age", 50] }
            }}
    }}
,{$unwind: "$XX"}
,{$sort: {"XX.age":-1}}])
Sign up to request clarification or add additional context in comments.

4 Comments

Won't having employees as separate entities bring me back to my original problem of this stackoverflow.com/questions/59080541/… ? Which is what spawned a series of 4-5 questions that got me to this current question.
And results of my experiment, the mongodb query took 3 seconds when I used {allowDiskUse:true}. The MySQL SELECT age FROM t_employee WHERE age => 50 ORDER BY age DESC took 1.2 seconds. My guess is MySQL doesn't enable allowDiskUse by default. No indices were applied in either scenario. Is this result expected for this scenario? Others also see MongoDB being 3x slower than MySQL?
Working on a map based alternative to bypass the 100X explosion of employee at the end....
Thanks, with the filter approach plus the {allowDiskUse:true}, the MongoDB query took 2 seconds. So that's good improvement from the 3 seconds!
1

You better move $match to the first pipeline, because aggregation framework loses index after first pipeline, also i guess you don't need to unwind those arrays.

Comments

0

i was able to get a result in 1.5 seconds without any indexes by modifying the query like the following:

db.businesses.aggregate([
    {
        $unwind: "$locations"
    },
    {
        $unwind: "$locations.departments"
    },
    {
        $unwind: "$locations.departments.teams"
    },
    {
        $unwind: "$locations.departments.teams.employees"
    },
    {
        $match: {
            "locations.departments.teams.employees.age": {
                $gte: 50
            }
        }
    },
    {
        $project: {
            _id: 0,
            age: "$locations.departments.teams.employees.age"
        }
    },
    {
        $group: {
            _id: "$age"
        }
    },
    {
        $project: {
            _id: 0,
            age: "$_id"
        }
    },
    {
        $sort: {
            "age": - 1
        }
    }
], {
    explain: false
})

3 Comments

Thanks, the {$group:{_id:"$age"}} effectively does the equivalent of GROUP BY age right? Which means if my employee has a name field, then I won't accurately get back the name of the employee along with their age (ie. SELECT name,age FROM employee)?
When I remove the {$group:_id:"$age"}, but leave in the sort filter, my mongo still crashes from not-enough memory. I don't have the same problem with MySQL. No indices were applied in either scenario.
In my MySQL SELECT name,age FROM employee WHERE age>=50 ORDER by name DESC also takes about 1 second. No indices were applied. Still can't get the MongoDB equivalent to not crash...
0

There is another way to address the overall problem, although it is not apples to apples with the OP question. The goal is to find all age >= 50 and sort. Below is an example that "almost" does so and throws in the loc,dept,team as well in case you were wondering how to get that too, but you can take out the lines to get just the emps. Now, this is unsorted -- but an argument can be made that the DB engine isn't going to do any better job of sorting this than the client and all the data has to come over the wire anyway. And the client can use more sophisticated coding tricks to dig thru to the age field and sort it.

c = db.foo.aggregate([
{$project: {XX:
  {$map: {input: "$locations", as:"z", in:
          {$map: {input: "$$z.departments", as:"z2", in:
                  {$map: {input: "$$z2.teams", as:"z3", in:
                          {loc: "$$z.name",  // remove if you want
                           dept: "$$z2.name", // remove if you want
                           team: "$$z3.name",  // remove if you want
                           emps: {$filter: {input: "$$z3.employees",
                                     as: "z4",
                                     cond: {$gt: [ "$$z4.age", 50] }
                                    }}
                          }
                      }}
              }}
      }}
    }}
]);

ages = [];

c.forEach(function(biz) {
    biz['XX'].forEach(function(locs) {
        locs.forEach(function(depts) {
            depts.forEach(function(teams) {
                teams['emps'].forEach(function(emp) {
                    ages.push(emp['age']);
                                    });
                            });
                    });
            });
    });

print( ages.sort(function(a, b){return b-a}) );

99,98,97,96,95,94,92,92,84,81,78,77,76,72,71,67,66,65,65,64,63,62,62,61,59,59,57,57,57,56,55,54,52,51

On a MacBook Pro running MongoDB 4.0, we see the collection as follows:

Collection            Count   AvgSize          Unz  Xz  +Idx     TotIdx  Idx/doc
--------------------  ------- -------- -G--M------  --- ---- ---M------  -------
                 foo       10   2238682     22386820  4.0    0      16384    0

Given the random age between 0 and 100, it is not surprising that every loc/dept/team has age >= 50 and that the total number of bytes returned is about half. Note, however that the total time to set up the agg -- not return all the bytes -- is ~700 millis.

697 millis to agg; 0.697
found 10
tot bytes 11536558

7 Comments

Also on your machine, does your query performance match that of MySQL?
I only have postgres. :-) And I wouldn't know the "best" way to model SQL to be performant. I could do a 4 way left outer join biz->loc->dept->teams->emps and filter for age >= 50 but I just haven't tried.
In my secnario, no joins would be necessary, because i'd only be selecting on fields from the employees table, and there's no need to reference any of the other tables. So you're simply doing a SELECT name from employees WHERE age >= 50 ORDER by age DESC. I'm curious how Postgres performs relative to MongoDB query you showed above!
Oh! I thought the goal was to identify emps >= 50 who were on teams (based on a sample nested SELECT example in your other post), not just the basic emps >= 50. No?
That's correct, i just want the names of emps with age >= 50 sorted in descending order of age. I don't need details of their business, locations, departments or teams. I can't find a MongoDB query that can perform as fast or efficiently as SELECT name FROM employees WHERE age >= 50 ORDER BY age DESC
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.