0

I am using the following ES query when looking for duplicates:

"aggs": {
    "duplicates": {
        "terms": {
            "field": "phone",
            "min_doc_count": 2,
            "size": 99999,
            "order": {
                "_term": "asc"
            }
        },
        "aggs": {
            "_docs": {
                "top_hits": {
                    "size": 99999
                }
            }
        }
    }
}

It works well, it returns the key which in this case is the phone, and inside of it it returns all the matches. The main problem is exactly that, on the _source it brings everything, which is a lot of fields on my case, and I wanted to specify to bring only the ones I need. Example of what's returning:

        "duplicates": {
                "1": {
                    "key": "1",
                    "doc_count": 2,
                    "_docs": {
                        "hits": {
                            "total": 2,
                            "max_score": 1,
                            "hits": [
                                {
                                    "_index": "local:company_id:1:sync",
                                    "_type": "leads",
                                    "_id": "23",
                                    "_score": 1,
                                    "_source": {
                                        "id": 23,
                                        "phone": 123456,
                                        "areacode_id": 426,
                                        "areacode_state_id": 2,
                                        "firstName": "Brayan",
                                        "lastName": "Rastelli",
                                        "state": "", // .... and so on

I want to specify the fields that will be returned on the _source, is that possible?

Another problem that I'm having is that I want to order the aggregation results by a specific field (by id) but if I put any field name instead of _term it gives me an error.

Thank you!

0

1 Answer 1

2

In the below example, documents with id 29 and 23 have the same phone, hence these are duplicates. The search query will show only two fields i.e id and phone (you can change these fields according to your condition) and sort the top hits result on the basis of id

Adding a working example with index data, search query, and search result

Index Data:

{
  "id": 29,
  "phone": 123456,
  "areacode_id": 426,
  "areacode_state_id": 2,
  "firstName": "Brayan",
  "lastName": "Rastelli",
  "state": ""
}
{
  "id": 23,
  "phone": 123456,
  "areacode_id": 426,
  "areacode_state_id": 2,
  "firstName": "Brayan",
  "lastName": "Rastelli",
  "state": ""
}
{
  "id": 30,
  "phone": 1235,
  "areacode_id": 92,
  "areacode_state_id": 10,
  "firstName": "Mark",
  "lastName": "Smith",
  "state": ""
}

Search Query:

{
  "size": 0,
  "aggs": {
    "duplicates": {
      "terms": {
        "field": "phone",
        "min_doc_count": 2,
        "size": 99999
      },
      "aggs": {
        "_docs": {
          "top_hits": {
            "_source": {
              "includes": [
                "phone",
                "id"
              ]
            },
            "sort": [
              {
                "id": {
                  "order": "asc"
                }
              }
            ]
          }
        }
      }
    }
  }
}

Search Result:

"aggregations": {
    "duplicates": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": 123456,
          "doc_count": 2,
          "_docs": {
            "hits": {
              "total": {
                "value": 2,
                "relation": "eq"
              },
              "max_score": null,
              "hits": [
                {
                  "_index": "66896259",
                  "_type": "_doc",
                  "_id": "1",
                  "_score": null,
                  "_source": {
                    "phone": 123456,
                    "id": 23
                  },
                  "sort": [
                    23                       // note this
                  ]
                },
                {
                  "_index": "66896259",
                  "_type": "_doc",
                  "_id": "2",
                  "_score": null,
                  "_source": {
                    "phone": 123456,
                    "id": 29
                  },
                  "sort": [
                    29                         // note this
                  ] 
                }
              ]
            }
          }
        }
      ]
    }
  }
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.