I have a set of 10.000 txt documents with old wikipedia articles in it. These articles were loaded into a mongoDB collection with a custom java program.
My document for each article looks like this:
{
"_id" : ObjectID("....."),
"doc_id" : 335814,
"terms" :
[
"2012", "2012", "adam", "knick", "basketball", ....
]
}
Now I want to calculate the occurences of each word in the array, the so called term frequency.
The resulting document should look like this:
{
"doc_id" : 335814,
"term_tf": [
{term: "2012", tf: 2},
{term: "adam", tf: 1},
{term: "knick", tf: 1},
{term: "basketball", tf: 1},
.....
]
}
But all I could achieve till now I could achieve something like this:
db.stemmedTerms.aggregate([{$unwind: "$terms" }, {$group: {_id: {id: "$doc_id", term: "$terms"}, tf: {$sum : 1}}}], { allowDiskUse:true } );
{ "_id" : { "id" : 335814, "term" : "2012" }, "tf" : 2 }
{ "_id" : { "id" : 335814, "term" : "adam" }, "tf" : 1 }
{ "_id" : { "id" : 335814, "term" : "knick" }, "tf" : 1 }
{ "_id" : { "id" : 335814, "term" : "basketball" }, "tf" : 1 }
But as you can see the document structure doesn't fit my needs. I just want to have the doc_id once and then an array with all the terms with the respective term frequency.
So I search something to do the opposite as the $unwind operator.
Thanks for all your help.
$groupin the pipeline topushterms back to array: docs.mongodb.org/manual/reference/operator/aggregation/push$groupthe query fails with following error message:BufBuilder attempted to grow() to 134217728 bytes, past the 64MB limit.", "code" : 13548My aggregation pipeline statement is the following:db.stemmedTerms.aggregate([{$unwind: "$terms" }, {$group: {_id: {id: "$doc_id", term: "$terms"}, tf: {$sum : 1}}}, {$group: {_id: "$id", term_tf: {$push: {term: "$term", tf: "$tf"}}}}], {allowDiskUse:true});