Handle duplicate records in Elasticsearch

Question

I am using Hadoop+ELK stack to build a analytic stack.I am trying to refresh the index on daily basis.

I am consuming data from a third party which is in CSV format.I have no control over the input data i.e. I cannot ask to change the schema for the CSV file.

The problem is there is no unique id in the CSV records or even combining columns to make a unique id will also not work.So,while refreshing Elasticsearch adds duplicate data to the index.

So,if day 1 data is like

Product1,Language1,Date1,$1
Product2,Language2,Date1,$12

Day2 data becomes

Product1,Language1,Date1,$1
Product2,Language2,Date1,$12
Product1,Language1,Date1,$1
Product2,Language2,Date1,$12
Product3,Language1,Date2,$5(new record added on day2)

Is there any good way to handle this in ELK.I am using Logstash to consume the csv files.

eran · Accepted Answer · 2015-07-15 13:08:15Z

4

I think it's all about the document "_id".

If you had a unique "_id" per document, there would not be a problem, as you'll just "update" the document to the same value. You could even set the mapping to not allow an update, if needed.

Your problem is that you do not link the "_id" of the doc to the content of the document (which is fine, for some cases).

I guess a simple solution would be to create your own "my_id" field and set the path of "_id" to it, like here.

The problem then becomes how to create that "my_id" field. I'd use a hash on the document.

An example python snippet for it would be (i'm sure you could find an appropriate ruby plugin):

import hashlib
hash_object = hashlib.sha1(b"Product2,Language2,Date1,$12")
hex_dig = hash_object.hexdigest()
print(hex_dig)

answered Jul 15, 2015 at 13:08

eran

15.3k38 gold badges107 silver badges150 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Crypt Over a year ago

Thanks @eran for the response.But as I mentioned the combination of "Product2,Language2,Date1,$12" is also not unique.So,for me it's not possible to create a id field form these fields that would be unique.

eran Over a year ago

Hmmm... Do you have access to the line number? Otherwise, it's not an ElasticSearch problem. You'd have the same problem with any database,

Crypt Over a year ago

I know it's not Elasticsearch problem.I can see only one way out.Recreating the index every time which is a bit costly.Was looking for something someone might have done earlier.

eran Over a year ago

How about adding a line number to the record before writing it, and using that number stage doc id?

Crypt Over a year ago

I considered that too but insertion can happen @ any position in the csv file so that will disturb the whole position.Say I number it 123 now a new record is inserted in between 2 and 3.So,the whole ordering will be wrong.

|

Ashish Goel · Accepted Answer · 2015-12-24 13:08:47Z

I believe the first part of the solution will be to identify a set of values that if used together will be unique for a document. If not, then there is no way to separate duplicate documents from the genuine ones. For the sake of discussion, let say the four values (Product1,Language1,Date1,$1) define a document. If there is another document with the same set values, then it is a duplicate of the previous document and not a new document.

Consider you have (Product1,Language1,Date1,$1), you can first execute a query which search whether this document already exist in ElasticSearch or not. Something like:

{
"filter": {
    "bool": {
        "must": [
            {
                "term": {
                    "pdtField": "Product1"
                }
            },
            {
                "term": {
                    "langField": "Language1"
                }
            },
            {
                "term": {
                    "dateField": "Date1"
                }
            },
            {
                "term": {
                    "costField": "$1"
                }
            }
        ]
    }
}
}

Take care of the name of the fields used here, according to whatever you are actually using. If this filter result has doc_count != 0 then you need not create a new document for this. Else create a new document with the values in hand.

Alternatively, you can create a document ID using the hash created out of (Product1,Language1,Date1,$1), then use this hash as the _id of the document. First check if any document with this _id exists or not. If it does not exist then create a new document with the values in hand against the _id value generated by hash.

In case, you do not have control over the way an individual document is getting created, then maybe you can try preprocessing your CSV input using the strategy advised above, leave only the needed entries in the CSV and get rid of rest, then carry as usual with the resultant CSV.

Collectives™ on Stack Overflow

Handle duplicate records in Elasticsearch

2 Answers 2

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related