How to count the occurrence of array elements in mongo db?

Question

I have a set of 10.000 txt documents with old wikipedia articles in it. These articles were loaded into a mongoDB collection with a custom java program.

My document for each article looks like this:

{ 
"_id" : ObjectID("....."),
"doc_id" : 335814,
"terms" : 
    [
          "2012", "2012", "adam", "knick", "basketball", ....
    ]
}

Now I want to calculate the occurences of each word in the array, the so called term frequency.

The resulting document should look like this:

{
"doc_id" : 335814,
"term_tf": [
      {term: "2012", tf: 2},
      {term: "adam", tf: 1},
      {term: "knick", tf: 1},
      {term: "basketball", tf: 1},
      .....
      ]
}

But all I could achieve till now I could achieve something like this:

db.stemmedTerms.aggregate([{$unwind: "$terms" }, {$group: {_id: {id: "$doc_id", term: "$terms"},  tf: {$sum : 1}}}], { allowDiskUse:true } );

{ "_id" : { "id" : 335814, "term" : "2012" }, "tf" : 2 }
{ "_id" : { "id" : 335814, "term" : "adam" }, "tf" : 1 }
{ "_id" : { "id" : 335814, "term" : "knick" }, "tf" : 1 }
{ "_id" : { "id" : 335814, "term" : "basketball" }, "tf" : 1 }

But as you can see the document structure doesn't fit my needs. I just want to have the doc_id once and then an array with all the terms with the respective term frequency.

So I search something to do the opposite as the $unwind operator.

Thanks for all your help.

You just need another $group in the pipeline to push terms back to array: docs.mongodb.org/manual/reference/operator/aggregation/push — Alex Blex
– Alex Blex, Commented Jan 26, 2016 at 14:30
When i try to add another $group the query fails with following error message: BufBuilder attempted to grow() to 134217728 bytes, past the 64MB limit.", "code" : 13548 My aggregation pipeline statement is the following: db.stemmedTerms.aggregate([{$unwind: "$terms" }, {$group: {_id: {id: "$doc_id", term: "$terms"}, tf: {$sum : 1}}}, {$group: {_id: "$id", term_tf: {$push: {term: "$term", tf: "$tf"}}}}], {allowDiskUse:true}); — s1m0on
– s1m0on, Commented Jan 29, 2016 at 13:24
Comments are not the best place for code snippets. Basically, aggregate cannot return more than 64MB, and you need to write it down to a collection using docs.mongodb.org/manual/reference/operator/aggregation/out See my answer below. — Alex Blex
– Alex Blex, Commented Jan 29, 2016 at 13:52

Alex Blex · Accepted Answer · 2016-01-29 13:50:35Z

5

With second $group and $out, your pipeline should look like:

db.stemmedTerms.aggregate([
    {$unwind: "$terms" }, 
    // count
    {$group: {
        _id: {id: "$doc_id", term: "$terms"},  
        tf: {$sum : 1}  
    }},
    // build array
    {$group: {
        _id: "$_id.id",  
        term_tf: {$push:  { term: "$_id.term", tf: "$tf" }}
    }},
    // write to new collection
    { $out : "occurences" }     
], 
{ allowDiskUse: true});

answered Jan 29, 2016 at 13:50

Alex Blex

37.3k7 gold badges53 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

s1m0on Over a year ago

Nice that works great. So I forgot the $out to write it into a collection. Thanks for your help :)

Collectives™ on Stack Overflow

How to count the occurrence of array elements in mongo db?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related