4

I have a set of 10.000 txt documents with old wikipedia articles in it. These articles were loaded into a mongoDB collection with a custom java program.

My document for each article looks like this:

{ 
"_id" : ObjectID("....."),
"doc_id" : 335814,
"terms" : 
    [
          "2012", "2012", "adam", "knick", "basketball", ....
    ]
}

Now I want to calculate the occurences of each word in the array, the so called term frequency.

The resulting document should look like this:

{
"doc_id" : 335814,
"term_tf": [
      {term: "2012", tf: 2},
      {term: "adam", tf: 1},
      {term: "knick", tf: 1},
      {term: "basketball", tf: 1},
      .....
      ]
}

But all I could achieve till now I could achieve something like this:

db.stemmedTerms.aggregate([{$unwind: "$terms" }, {$group: {_id: {id: "$doc_id", term: "$terms"},  tf: {$sum : 1}}}], { allowDiskUse:true } );

{ "_id" : { "id" : 335814, "term" : "2012" }, "tf" : 2 }
{ "_id" : { "id" : 335814, "term" : "adam" }, "tf" : 1 }
{ "_id" : { "id" : 335814, "term" : "knick" }, "tf" : 1 }
{ "_id" : { "id" : 335814, "term" : "basketball" }, "tf" : 1 }

But as you can see the document structure doesn't fit my needs. I just want to have the doc_id once and then an array with all the terms with the respective term frequency.

So I search something to do the opposite as the $unwind operator.

Thanks for all your help.

3
  • 1
    You just need another $group in the pipeline to push terms back to array: docs.mongodb.org/manual/reference/operator/aggregation/push Commented Jan 26, 2016 at 14:30
  • When i try to add another $group the query fails with following error message: BufBuilder attempted to grow() to 134217728 bytes, past the 64MB limit.", "code" : 13548 My aggregation pipeline statement is the following: db.stemmedTerms.aggregate([{$unwind: "$terms" }, {$group: {_id: {id: "$doc_id", term: "$terms"}, tf: {$sum : 1}}}, {$group: {_id: "$id", term_tf: {$push: {term: "$term", tf: "$tf"}}}}], {allowDiskUse:true}); Commented Jan 29, 2016 at 13:24
  • Comments are not the best place for code snippets. Basically, aggregate cannot return more than 64MB, and you need to write it down to a collection using docs.mongodb.org/manual/reference/operator/aggregation/out See my answer below. Commented Jan 29, 2016 at 13:52

1 Answer 1

5

With second $group and $out, your pipeline should look like:

db.stemmedTerms.aggregate([
    {$unwind: "$terms" }, 
    // count
    {$group: {
        _id: {id: "$doc_id", term: "$terms"},  
        tf: {$sum : 1}  
    }},
    // build array
    {$group: {
        _id: "$_id.id",  
        term_tf: {$push:  { term: "$_id.term", tf: "$tf" }}
    }},
    // write to new collection
    { $out : "occurences" }     
], 
{ allowDiskUse: true});
Sign up to request clarification or add additional context in comments.

1 Comment

Nice that works great. So I forgot the $out to write it into a collection. Thanks for your help :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.