MongoDB query to find document with duplicate value in array

Question

tldr; I'm struggling to construct a query to

Make an aggregation to get a count of values on a certain key ("original_text_source"), which
Is in a sub-document that is in an array

Full description

I have embedded documents with arrays that are structured like this:

{
    "_id" : ObjectId("0123456789"),
    "type" : "some_object",
    "relationships" : {
        "x" : [ ObjectId("0123456789") ],
        "y" : [ ObjectId("0123456789") ],
    },
    "properties" : [ 
        {
            "a" : "1"
        }, 
        {
            "b" : "1"
        }, 
        {
            "original_text_source" : "foo.txt"
        },
    ]
}

The docs were created from exactly 10k text files, sorted in various folders. During inserting documents into the MongoDB (in batches) I messed up and moved a few files around, causing one file to be imported twice (my database has a count of exactly 10001 docs), but obviously I don't know which one it is. Since one of the "original_text_source" values has to have a count of 2, I was planning on just deleting one.

I read up on solutions with $elemMatch, but since my array element is a document, I'm not sure how to proceed. Maybe with mapReduce? But I can't transfer the logic to my doc structure.

I also could just create a new collection and reupload all, but in case I mess up again, I'd rather like to learn how to query for duplicates. It seems more elegant :-)

barbakini · Accepted Answer · 2017-10-05 13:32:05Z

4

You can find duplicates with a simple aggregation like this:

db.collection.aggregate(
{ $group: { _id: "$properties.original_text_source", docIds: { $push: "$_id" }, docCount: { $sum: 1 } } },
{ $match: { "docCount": { $gt: 1 } } }
)

which gives you something like this:

{
"_id" : [ 
    "foo.txt"
],
"docIds" : [ 
    ObjectId("59d6323613940a78ba1d5ffa"), 
    ObjectId("59d6324213940a78ba1d5ffc")
],
"docCount" : 2.0
}

answered Oct 5, 2017 at 13:32

barbakini

3,1842 gold badges22 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

fukiburi Over a year ago

Thank you! Works like a charm :-)

glytching · Accepted Answer · 2017-10-05 13:53:04Z

Run the following:

db.collection.aggregate([
  { $group: {
    _id: { name: "$properties.original_text_source" },
    idsForDuplicatedDocs: { $addToSet: "$_id" },
    count: { $sum: 1 } 
  } }, 
  { $match: { 
    count: { $gte: 2 } 
  } },
  { $sort : { count : -1} }
]);

Given a collection which contains two copies of the document you showed in your question, the above command will return:

{
    "_id" : {
        "name" : [ 
            "foo.txt"
        ]
    },
    "idsForDuplicatedDocs" : [ 
        ObjectId("59d631d2c26584cd8b7b3337"), 
        ObjectId("59d631cbc26584cd8b7b3333")
    ],
    "count" : 2
}

Where ...

The attribute _id.name is the value of the duplicated properties.original_text_source
The attribute idsForDuplicatedDocs contains the _id values for each of the documents which have a duplicated properties.original_text_source

Guillaume Raymond · Accepted Answer · 2018-10-31 11:41:42Z

-1

"reviewAndRating": [
    {
      "review": "aksjdhfkashdfkashfdkjashjdkfhasdkjfhsafkjhasdkjfhasdjkfhsdakfj",
      "productId": "5bd956f29fcaca161f6b7517",
      "_id": "5bd9745e2d66162a6dd1f0ef",
      "rating": "5"
    },
    {
      "review": "aksjdhfkashdfkashfdkjashjdkfhasdkjfhsafkjhasdkjfhasdjkfhsdakfj",
      "productId": "5bd956f29fcaca161f6b7518",
      "_id": "5bd974612d66162a6dd1f0f0",
      "rating": "5"
    },
    {
      "review": "aksjdhfkashdfkashfdkjashjdkfhasdkjfhsafkjhasdkjfhasdjkfhsdakfj",
      "productId": "5bd956f29fcaca161f6b7517",
      "_id": "5bd974622d66162a6dd1f0f1",
      "rating": "5"
    }
  ]

edited Oct 31, 2018 at 11:41

Guillaume Raymond

2,0341 gold badge23 silver badges35 bronze badges

answered Oct 31, 2018 at 10:01

SUmit RUhela

1

3 Comments

Guillaume Raymond Over a year ago

hi @SUmit can you elaborate a bit more why it is a solution ?

Filnor Over a year ago

While this code snippet may solve the question, including an explanation really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. Please also try not to crowd your code with explanatory comments, this reduces the readability of both the code and the explanations!

SUmit RUhela Over a year ago

@GuillaumeRAYMOND now i have no issue thanks for replying me

Collectives™ on Stack Overflow

MongoDB query to find document with duplicate value in array

3 Answers 3

1 Comment

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related