5

tldr; I'm struggling to construct a query to

  1. Make an aggregation to get a count of values on a certain key ("original_text_source"), which
  2. Is in a sub-document that is in an array

Full description

I have embedded documents with arrays that are structured like this:

{
    "_id" : ObjectId("0123456789"),
    "type" : "some_object",
    "relationships" : {
        "x" : [ ObjectId("0123456789") ],
        "y" : [ ObjectId("0123456789") ],
    },
    "properties" : [ 
        {
            "a" : "1"
        }, 
        {
            "b" : "1"
        }, 
        {
            "original_text_source" : "foo.txt"
        },
    ]
}

The docs were created from exactly 10k text files, sorted in various folders. During inserting documents into the MongoDB (in batches) I messed up and moved a few files around, causing one file to be imported twice (my database has a count of exactly 10001 docs), but obviously I don't know which one it is. Since one of the "original_text_source" values has to have a count of 2, I was planning on just deleting one.

I read up on solutions with $elemMatch, but since my array element is a document, I'm not sure how to proceed. Maybe with mapReduce? But I can't transfer the logic to my doc structure.

I also could just create a new collection and reupload all, but in case I mess up again, I'd rather like to learn how to query for duplicates. It seems more elegant :-)

3 Answers 3

4

You can find duplicates with a simple aggregation like this:

db.collection.aggregate(
{ $group: { _id: "$properties.original_text_source", docIds: { $push: "$_id" }, docCount: { $sum: 1 } } },
{ $match: { "docCount": { $gt: 1 } } }
)

which gives you something like this:

{
"_id" : [ 
    "foo.txt"
],
"docIds" : [ 
    ObjectId("59d6323613940a78ba1d5ffa"), 
    ObjectId("59d6324213940a78ba1d5ffc")
],
"docCount" : 2.0
}
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you! Works like a charm :-)
1

Run the following:

db.collection.aggregate([
  { $group: {
    _id: { name: "$properties.original_text_source" },
    idsForDuplicatedDocs: { $addToSet: "$_id" },
    count: { $sum: 1 } 
  } }, 
  { $match: { 
    count: { $gte: 2 } 
  } },
  { $sort : { count : -1} }
]);

Given a collection which contains two copies of the document you showed in your question, the above command will return:

{
    "_id" : {
        "name" : [ 
            "foo.txt"
        ]
    },
    "idsForDuplicatedDocs" : [ 
        ObjectId("59d631d2c26584cd8b7b3337"), 
        ObjectId("59d631cbc26584cd8b7b3333")
    ],
    "count" : 2
}

Where ...

  • The attribute _id.name is the value of the duplicated properties.original_text_source
  • The attribute idsForDuplicatedDocs contains the _id values for each of the documents which have a duplicated properties.original_text_source

Comments

-1
"reviewAndRating": [
    {
      "review": "aksjdhfkashdfkashfdkjashjdkfhasdkjfhsafkjhasdkjfhasdjkfhsdakfj",
      "productId": "5bd956f29fcaca161f6b7517",
      "_id": "5bd9745e2d66162a6dd1f0ef",
      "rating": "5"
    },
    {
      "review": "aksjdhfkashdfkashfdkjashjdkfhasdkjfhsafkjhasdkjfhasdjkfhsdakfj",
      "productId": "5bd956f29fcaca161f6b7518",
      "_id": "5bd974612d66162a6dd1f0f0",
      "rating": "5"
    },
    {
      "review": "aksjdhfkashdfkashfdkjashjdkfhasdkjfhsafkjhasdkjfhasdjkfhsdakfj",
      "productId": "5bd956f29fcaca161f6b7517",
      "_id": "5bd974622d66162a6dd1f0f1",
      "rating": "5"
    }
  ]

3 Comments

hi @SUmit can you elaborate a bit more why it is a solution ?
While this code snippet may solve the question, including an explanation really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. Please also try not to crowd your code with explanatory comments, this reduces the readability of both the code and the explanations!
@GuillaumeRAYMOND now i have no issue thanks for replying me

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.