0

I have been using the following MySQL command to construct a heatmap from log data. However, I have a new data set that is stored in a Mongo database and I need to run the same command.

 select concat(a.packages '&' b.packages) "Concurrent Packages",
 count(*) "Count"
 from data a
 cross join data b
 where a.packages<b.packages and a.jobID=b.jobID
 group by a.packages, b.packages
 order by a.packages, b.packages;

Keep in mind that the tables a and b do not exist prior to the query. However, they are created from the packages column of the data table, which has jobID as the field which I want to check for matches. In other words if two packages are within the same job I want to add an entry to the concurrent usage count. How can I generate a similar query in Mongo?

1
  • 2
    What have you tried? Have you looked at this page for inspiration? Commented Mar 8, 2013 at 2:52

2 Answers 2

2

This is not a "join" of different documents; it is an operation within one document, and can be done in MongoDB.

You have a SQL TABLE "data" like this:
  JobID   TEXT,
  package TEXT

The best way to store this in MongoDB will be a collection called "data", containing one document per JobID that contains an array of packages:

{
    _id: <JobID>,
    packages: [
        "packageA",
        "packageB",
        ....
    ]
}

[ Note: you could also implement your data table as only one document in MongoDB, containing an array of jobs which contain each an array of packages. This is not recommended, because you might hit the 16MB document size limit and nested arrays are not (yet) well supported by different queries - if you want to use the data for other purposes as well ]

Now, how to get a result like this ?

{ pair: [ "packageA", "packageB" ], count: 20 },
{ pair: [ "packageA", "packageC" ], count: 11 },
...

As there is no built-in "cross join" of two arrays in MongoDB, you'll have to program it out in the map function of a mapReduce(), emitting each pair of packages as a key:

mapf = function () {
    that = this;
    this.packages.forEach( function( p1 ) {
        that.packages.forEach( function( p2 ) {
            if ( p1 < p2 ) {
                key = { "pair": [ p1, p2 ] };
                emit( key, 1 );
            };
        });
    });
};

[ Note: this could be optimized, if the packages arrays were sorted ]

The reduce function is nothing more than summing up the counters for each key:

reducef = function( key, values ) {
    count = 0;
    values.forEach( function( value ) { count += value } );
    return count;
};

So, for this example collection:

> db.data.find()
{ "_id" : "Job01", "packages" : [ "pA", "pB", "pC" ] }
{ "_id" : "Job02", "packages" : [ "pA", "pC" ] }
{ "_id" : "Job03", "packages" : [ "pA", "pB", "pD", "pE" ] }

we get the following result:

> db.data.mapReduce(
...     mapf,
...     reducef,
...     { out: 'pairs' }
... );
{
    "result" : "pairs",
    "timeMillis" : 443,
    "counts" : {
        "input" : 3,
        "emit" : 10,
        "reduce" : 2,
        "output" : 8
    },
    "ok" : 1,
}
> db.pairs.find()
{ "_id" : { "pair" : [ "pA", "pB" ] }, "value" : 2 }
{ "_id" : { "pair" : [ "pA", "pC" ] }, "value" : 2 }
{ "_id" : { "pair" : [ "pA", "pD" ] }, "value" : 1 }
{ "_id" : { "pair" : [ "pA", "pE" ] }, "value" : 1 }
{ "_id" : { "pair" : [ "pB", "pC" ] }, "value" : 1 }
{ "_id" : { "pair" : [ "pB", "pD" ] }, "value" : 1 }
{ "_id" : { "pair" : [ "pB", "pE" ] }, "value" : 1 }
{ "_id" : { "pair" : [ "pD", "pE" ] }, "value" : 1 }

For more information on mapReduce consult: http://docs.mongodb.org/manual/reference/method/db.collection.mapReduce/ and http://docs.mongodb.org/manual/applications/map-reduce/

Sign up to request clarification or add additional context in comments.

Comments

1

You can't. Mongo doesn't do joins. Switching from SQL to Mongo is a lot more involved than migrating your queries.

Typically, you would include all the pertinent information in the same record (rather than normalize the information and select it with a join). Denormalize!

5 Comments

So what you are telling me is that there is no way to query a MongoDB to count the number of times 2 packages are used as part of the same job. Somehow I find that hard to believe.
That's not what I'm saying. I'm saying you would actually cache that value, and save it with the job record in question (I'm still a little fuzzy if you're calculating this for a specific job or any job with two packages)
Any job that has multiple packages run, I would like to store the count for each pair of packages that are used in the same job. So for example, if I have packageA and packageB and they are used concurrently as part of 20 jobs. I would like a query that returns a count for the number of times each pair of packages is used as part of the same job.
I am not sure who upvoted this, but it definitely was not a good answer, since I know that it is possible to do a query like I am requesting and I was not transferring a database this is a new database.
No, seriously. It's not. You would need multiple queries and application logic to get the results you want, unless you denormalize.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.