Optimize a Mongodb query

Question

db.runCommand({
    'mapreduce': 'rtb_ymantnt_xyz',
    'map': 'function() { var band_size = 0.5; var cscsm_id = this.cscsm_id;var site_id =this.entity_id;var tld=this.domain_id;
    this.xyz.forEach(function(h) {
        var converted_sum_cpm = h.sum_cpm;
        var bin_start = Math.floor(converted_sum_cpm / h.bin_volume / band_size) * band_size;
        emit({
            site_id: site_id,
            tld: tld,
            bin: bin_start
        }, {
            sum_cpm: h.sum_cpm,
            bin_volume: h.bin_volume,
            band_size: band_size
        })
    })
    }
    ',
    'reduce': 'function(key, values) { var result = {sum_cpm : 0, bin_volume : 0 , band_size : 0};
    values.forEach(function(value) {
    result.sum_cpm += value.sum_cpm;
    result.bin_volume += value.bin_volume;
    result.band_size = value.band_size;
    });
    return result;
}
',
'verbose': true,
'query': {
    'entity_reference_id': 43568,
    'date': {
    '$gte': '2015-06-15',
    '$lte': '2015-06-15'
    },
    'entity_type': 1,
    'domain_id': {
    '$ne': -1
    },
    'cscsm_id': {
    '$ne': -1
    },
    'bid_type': 1
},
'out': {
    'replace': 'xyz_debug_anmsyo'
}
})

db.xyz_debug_anmsyo.find().forEach(function(x) {
    print(x._id.site_id + "," + x._id.tld + "," + x._id.bin + "," + x.value.bin_volume + "," + x.value.sum_cpm)
})

Can someone please suggest to optimise this mongo query.

Blakes Seven · Accepted Answer · 2015-09-22 08:21:17Z

You can start of by using the .aggregate() method instead of the current mapReduce. Aggregation pipeline code is natively coded and does not require the overhead of JavaScript translation or any conversion of the documents to be able to work in JavaScript.

It's not like you are doing anything special here, just a query condition then processing each element of an array within each document to group on common keys and sum up values. For this there is a direct translation:

    db.rtb_ymantnt_xyz.aggregate([
        { "$match": {
            "entity_reference_id": 43568,
            "date": "2015-06-15",
            "entity_type": 1,
            "domain_id": { "$ne": -1 },
            "cscsm_id": { "$ne": -1  },
            "bid_type": 1
        }},
        { "$unwind": "$xyz" },
        { "$group": {
            "_id": {
                "site_id": "$entity_id",
                "tld": "$domain_id",
                "bin": {
                    "$multiply": [
                        { "$subtract": [
                            { "$divide": [
                                "$xyz.sum_cpm",
                                "$xyz.bin_volume",
                                0.5
                            ]},
                            { "$mod": [
                                { "$divide": [
                                    "$xyz.sum_cpm",
                                    "$xyz.bin_volume",
                                    0.5
                                ]},
                                1
                            ]}
                        ]},
                        0.5
                    ]
                }
            },
            "sum_cpm": { "$sum": "$xyz.sum_cpm" },
            "bin_volume": { "$sum": "$xyz.bin_volume" },
            "band_size": { "$last": { "$literal": 0.5 } }
        }}
    ])

Also note in the query it makes little sense to use a "range" expression for your "string date" when you are only looking at a single date. Consider converting to BSON Date though as there is more flexibility and it generally stores more compact then using strings.

Aside from that, then add indexes. And mostly on the fields where you specify "exact values":

            "entity_reference_id": 43568,
            "date": "2015-06-15",
            "entity_type": 1,
            "bid_type": 1

The ordering of fields is quite important, as you should generall first list the fields that are going to reduce down the possible matches the most, and it is also good to write your query conditions that way. So hopefully a field like entity_reference_id and then logically the date will filter down the results mostly and should be indexed in that order. Other fields are optional if they are not the major filter, but they do help.

Indexes speed up queries, and especially when the field order complements the filtering process. Naturally there is an additional cost on write as well as a storage cost, but if you want faster and to take some load of the engine then you should have them.

If you must, then there is also the $out aggregation pipeline stage to write to a collection instead. But just like with mapReduce, you should not really do this unless your output size is particularly large. And don't forget that in modern MongoDB, .aggregate() can return a cursor, which unlike "inline" mapReduce results can be iterated to conserve loading the whole results in memory.

Collectives™ on Stack Overflow

Optimize a Mongodb query

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related