I believe the first part of the solution will be to identify a set of values that if used together will be unique for a document. If not, then there is no way to separate duplicate documents from the genuine ones.
For the sake of discussion, let say the four values (Product1,Language1,Date1,$1) define a document. If there is another document with the same set values, then it is a duplicate of the previous document and not a new document.
Consider you have (Product1,Language1,Date1,$1), you can first execute a query which search whether this document already exist in ElasticSearch or not. Something like:
{
"filter": {
"bool": {
"must": [
{
"term": {
"pdtField": "Product1"
}
},
{
"term": {
"langField": "Language1"
}
},
{
"term": {
"dateField": "Date1"
}
},
{
"term": {
"costField": "$1"
}
}
]
}
}
}
Take care of the name of the fields used here, according to whatever you are actually using.
If this filter result has doc_count != 0 then you need not create a new document for this. Else create a new document with the values in hand.
Alternatively, you can create a document ID using the hash created out of (Product1,Language1,Date1,$1), then use this hash as the _id of the document. First check if any document with this _id exists or not. If it does not exist then create a new document with the values in hand against the _id value generated by hash.
In case, you do not have control over the way an individual document is getting created, then maybe you can try preprocessing your CSV input using the strategy advised above, leave only the needed entries in the CSV and get rid of rest, then carry as usual with the resultant CSV.