1

OK I am very new to Mongo, and I am already stuck.

Db has the following structure (much simplified for sure):

{
    {
        "_id" : ObjectId("57fdfbc12dc30a46507044ec"),

        "keyterms" : [ 
            {
                "score" : "2",
                "value" : "AA",
            }, 
            {
                "score" : "2",
                "value" : "AA",
            }, 
            {
                "score" : "4",
                "value" : "BB",
            },
            {
                "score" : "3",
                "value" : "CC",
            }
        ]
    },

    {
        "_id" : ObjectId("57fdfbc12dc30a46507044ef"),

        "keyterms" : [ 
        ...

There are some Objects. Each Object have an array "keywords". Each of this Arrays Entries, which have score and value. There are some duplicates though (not really, since in the real db the keywords entries have much more fields, but concerning value and score they are duplicates).

Now I need a query, which

  • selects one object by id
  • groups its keyterms in by value
  • and counts the dublicates
  • sorts them by score

So I want to have something like that as result

// for Object 57fdfbc12dc30a46507044ec
"keyterms"; [
    {
        "score" : "4",
        "value" : "BB",
        "count" : 1
    },


    {
        "score" : "3",
        "value" : "CC",
        "count" : 1
    }

    {
        "score" : "2",
        "value" : "AA",
        "count" : 2
    }

]

In SQL I would have written something like this

select 
    score, value, count(*) as count
from
    all_keywords_table_or_some_join
group by
    value
order by
    score

But, sadly enough, it's not SQL.

In Mongo I managed to write this:

db.getCollection('tests').aggregate([
    {$match: {'_id': ObjectId('57fdfbc12dc30a46507044ec')}},
    {$unwind: "$keyterms"}, 
    {$sort: {"keyterms.score": -1}}, 
    {$group: {
        '_id': "$_id", 
        'keyterms': {$push: "$keyterms"}
    }},
    {$project: {
        'keyterms.score': 1,
        'keyterms.value': 1
    }}
])

But there is something missing: the grouping of the the keywords by their value. I can not get rid of the feeling, that this is the wrong approach at all. How can I select the keywords array and continue with that, and use an aggregate function inly on this - that would be easy.

BTW I read this (Mongo aggregate nested array) but I can't figure it out for my example unfortunately...

2 Answers 2

4

You'd want an aggregation pipeline where after you $unwind the array, you group the flattened documents by the array's value and score keys, aggregate the counts using the $sum accumulator operator and retain the main document's _id with the $first operator.

The preceding pipeline should then group the documents from the previous pipeline by the _id key so as to preserve the original schema and recreate the keyterms array using the $push operator.

The following demonstration attempts to explain the above aggregation operation:

db.tests.aggregate([
    { "$match": { "_id": ObjectId("57fdfbc12dc30a46507044ec") } },
    { "$unwind": "$keyterms" },
    {
        "$group": {
            "_id": {
                "value": "$keyterms.value",
                "score": "$keyterms.score"
            },
            "doc_id": { "$first": "$_id" },
            "count": { "$sum": 1 }
        }
    },
    { "$sort": {"_id.score": -1 } },
    {
        "$group": {
            "_id": "$doc_id",
            "keyterms": {
                "$push": {
                    "value": "$_id.value",
                    "score": "$_id.score",
                    "count": "$count"
                }
            }
        }
    }
])

Sample Output

{
    "_id" : ObjectId("57fdfbc12dc30a46507044ec"),
    "keyterms" : [ 
        {
            "value" : "BB",
            "score" : "4",
            "count" : 1
        }, 
        {
            "value" : "CC",
            "score" : "3",
            "count" : 1
        }, 
        {
            "value" : "AA",
            "score" : "2",
            "count" : 2
        }
    ]
}

Demo

enter image description here

Sign up to request clarification or add additional context in comments.

3 Comments

I tested your solution again and realized, it does not do the right thing: there are still two entries for the keyterm "AA" in the result!
@Paflow I don't see the duplicates as I've tested the pipeline. The initial $group pipeline step handles this, I'm failing to see where you are getting two entries for "AA". Perhaps you could update your question with some sample documents to verify that with?
Sorry my mistake. I did a mistake in re-morphing it back to my real document format. Thank you anyway
1

Meanwhile, I solved it myself:

aggregate([
        {$match: {'_id': ObjectId('57fdfbc12dc30a46507044ec')}},
        {$unwind: "$keyterms"},
        {$sort: {"keyterms.score": -1}}, 
        {$group: {
            '_id': "$keyterms.value", 
            'keyterms': {$push: "$keyterms"},
            'escore': {$first: "$keyterms.score"},
            'evalue': {$first: "$keyterms.value"}
        }},
        {$limit: 15},
        {$project: {
          "score": "$escore", 
          "value": "$evalue",
          "count": {$size: "$keyterms"}
        }}      
])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.