First of all it's my first time in Mongo...
Concept:
- A user is able to describe an image in natural language.
- Divide the user input and store the words he described in a Collection called words.
- Users must be able to go through the most used words and add those words to their description.
- The system will use the most used words (for all users) and use those words to describe the image.
My words document (currently) is as follows (example)
{
"date": "date it was inserted"
"reported": 0,
"image_id": "image id"
"image_name": "image name"
"user": "user _id"
"word": "awesome"
}
The words will be duplicated so that each word can be associated to a user...
Problem: I need to perform a Mongo query to allow me to know the most used words (to describe an image) that were not created by a given user. (to meet point 3. above)
I've seen MapReduce algorithm, but from what I read there are a couple of issues with it:
- Can't sort results (I can order from the most used to less used)
- In millions of documents it can have a large processing time.
- Can't limit the number of the results returned
I've thought about running a task at a given time each day to store on a document (in a different collection) the list the rank of words that a given user hasn't used to describe the given image. I would have to limit this to 300 results or something (any idea on a proper limit??) Something like:
{
user_id: "the user id"
[
{word: test, count: 1000},
{word: test2, count: 980},
{word: etc, count: 300}
]
}
Problems I see with this solution are:
- Results would have quite a delay which is not desirable.
- Server loads while generating this documents for all users can spike (I actually know very little about this in Mongo so this is just an assumption)
Maybe my approach doesn't make any sense... And maybe my lack of experience in Mongo is pointing me at the wrong "schema design".
Any idea of what could be a good approach for this kind of problem?
Sorry for the big post and thanks for your time and help!
João