Mongo DB MapReduce in PHP

Question

First of all it's my first time in Mongo...

Concept:

A user is able to describe an image in natural language.
Divide the user input and store the words he described in a Collection called words.
Users must be able to go through the most used words and add those words to their description.
The system will use the most used words (for all users) and use those words to describe the image.

My words document (currently) is as follows (example)

{
"date": "date it was inserted"
"reported": 0,
"image_id": "image id"
"image_name": "image name"
"user": "user _id"
"word": "awesome"
}

The words will be duplicated so that each word can be associated to a user...

Problem: I need to perform a Mongo query to allow me to know the most used words (to describe an image) that were not created by a given user. (to meet point 3. above)

I've seen MapReduce algorithm, but from what I read there are a couple of issues with it:

Can't sort results (I can order from the most used to less used)
In millions of documents it can have a large processing time.
Can't limit the number of the results returned

I've thought about running a task at a given time each day to store on a document (in a different collection) the list the rank of words that a given user hasn't used to describe the given image. I would have to limit this to 300 results or something (any idea on a proper limit??) Something like:

{
user_id: "the user id"
[
{word: test, count: 1000},
{word: test2, count: 980},
{word: etc, count: 300}
]
}

Problems I see with this solution are:

Results would have quite a delay which is not desirable.
Server loads while generating this documents for all users can spike (I actually know very little about this in Mongo so this is just an assumption)

Maybe my approach doesn't make any sense... And maybe my lack of experience in Mongo is pointing me at the wrong "schema design".

Any idea of what could be a good approach for this kind of problem?

Sorry for the big post and thanks for your time and help!

João

golja · Accepted Answer · 2012-06-25 06:39:43Z

3

As already mentioned you could use the group command which is easy to use, but you will need to sort the result on the client side. Also the result is returned as a single BSON object and for this reason must be fairly small – less than 10,000 keys, else you will get an exception.

Code example based on your data structure:

db.words.group({
    key : {"word" : true},
    initial: {count : 0},
    reduce: function(obj, prev) { prev.count++},
    cond: {"user" :{ $ne : "USERNAME_TO_IGNORE"}}
})

Another option is to use the new Aggregation framework, which will be released in the 2.2 version. Something like that should work.

db.words.aggregate({
   $match : { "user" : { "$ne" : "USERNAME_TO_IGNORE"} },
   $group : {
     _id : "$word",
     count: { $sum : 1}
   }
})

Or you can still use MapReduce. Actually you can limit and sort the output, because the result is an collection. Just use .sort() and .limit() on the output. Also you can use the incremental map-reduce output option, which will help you solve your performance issues. Have a look at the out parameter in MapReduce.

Bellow it's an example, which use the incremental feature to merge the existing collection with new data in a words_usage collection:

m = function() { 
   emit(this.word, {count: 1}); 
};


r = function( key , values ){
     var sum = 0;
     values.forEach(function(doc) {
          sum += doc.count;
     });
     return {count: sum};
 };

db.runCommand({
    mapreduce : "words", 
    map : m,
    reduce : r,
    out : { reduce: "words_usage"},
    query : <query filter object>
})

# retrieve the top 10 words
db.words_usage.find().sort({"value.count" : -1}).sort({"value.count" : -1}).limit(10)

I guess you can run the above MapReduce command in a cron every few minutes/hours, depends how accurate results you want. For the update query criteria you can use the words documents creation date.

Once you have the system top words collection you can build per user top words or just compute them in real time (depends on the system size).

answered Jun 25, 2012 at 6:39

golja

1,0938 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

jribeiro Over a year ago

This is quite an answer! Really looking forward the 2.2 version then. Meanwhile I'll try out your comments! Really appreciated

jribeiro Over a year ago

Sorry. But the limit on mapreduce is applied after everything is calculated or it will stop the calculation on 10 documents? Also performance wise what would be your recomendation having in consideration that I'll be sorting the results in php (on the group case at least).

golja Over a year ago

The limit is applied after MapReduce is executed or whenever you sort/find on the collection. As I said you can reuse the results whenever you want. When and how often you update the collection is up to you. Also performance wise you should be all right with sorting on a php side. If in the future that will be a problem just use some kind of buffers.

matt3141 · Accepted Answer · 2012-06-25 00:05:00Z

1

The group function is supposed to be a simpler version of MapReduce. You could use it like this to get a sum for each word:

db.coll.group(
           {key: { a:true, b:true },
            cond: { active:1 },
            reduce: function(obj,prev) { prev.csum += obj.c; },
            initial: { csum: 0 }
            });

answered Jun 25, 2012 at 0:05

matt3141

4,4411 gold badge22 silver badges25 bronze badges

1 Comment

matt3141 Over a year ago

yeah sorting would have to happen on the client side

Collectives™ on Stack Overflow

Mongo DB MapReduce in PHP

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related