Importing and updating data in Elasticsearch

Question

We have an existing search function that involves data across multiple tables in SQL Server. This causes a heavy load on our DB, so I'm trying to find a better way to search through this data (it doesn't change very often). I have been working with Logstash and Elasticsearch for about a week using an import containing 1.2 million records. My question is essentially, "how do I update existing documents using my 'primary key'"?

CSV data file (pipe delimited) looks like this:

369|90045|123 ABC ST|LOS ANGELES|CA
368|90045|PVKA0010|LA|CA
367|90012|20000 Venice Boulvd|Los Angeles|CA
365|90045|ABC ST 123|LOS ANGELES|CA
363|90045|ADHOCTESTPROPERTY|DALES|CA

My logstash config looks like this:

input {
  stdin {
    type => "stdin-type"
  }

  file {
    path => ["C:/Data/sample/*"]
    start_position => "beginning"
  }
}

filter {
  csv {
    columns => ["property_id","postal_code","address_1","city","state_code"]
    separator => "|"
  }
}

output {
  elasticsearch {
    embedded => true
    index => "samples4"
    index_type => "sample"
  }
}

A document in elasticsearch, then looks like this:

{
   "_index": "samples4",
   "_type": "sample",
   "_id": "64Dc0_1eQ3uSln_k-4X26A",
   "_score": 1.4054651,
   "_source": {
   "message": [
      "369|90045|123 ABC ST|LOS ANGELES|CA\r"
   ],
   "@version": "1",
   "@timestamp": "2014-02-11T22:58:38.365Z",
   "host": "[host]",
   "path": "C:/Data/sample/sample.csv",
   "property_id": "369",
   "postal_code": "90045",
   "address_1": "123 ABC ST",
   "city": "LOS ANGELES",
   "state_code": "CA"
}

I think would like the unique ID in the _id field, to be replaced with the value of property_id. The idea is that subsequent data files would contain updates. I don't need to keep previous versions and there wouldn't be a case where we added or removed keys from a document.

The document_id setting for elasticsearch output doesn't put that field's value into _id (it just put in "property_id" and only stored/updated one document). I know I'm missing something here. Am I just taking the wrong approach?

EDIT: WORKING!

Using @rutter's suggestion, I've updated the output config to this:

output {
  elasticsearch {
    embedded => true
    index => "samples6"
    index_type => "sample"
    document_id => "%{property_id}"
  }
}

Now documents are updating by dropping new files into the data folder as expected. _id and property_id are the same value.

{
   "_index": "samples6",
   "_type": "sample",
   "_id": "351",
   "_score": 1,
   "_source": {
   "message": [
      "351|90045|Easy as 123 ST|LOS ANGELES|CA\r"
   ],
   "@version": "1",
   "@timestamp": "2014-02-12T16:12:52.102Z",
   "host": "TXDFWL3474",
   "path": "C:/Data/sample/sample_update_3.csv",
   "property_id": "351",
   "postal_code": "90045",
   "address_1": "Easy as 123 ST",
   "city": "LOS ANGELES",
   "state_code": "CA"
}

You can overwrite a document by sending another document with the same ID... but that might be tricky with your previous data, since you'll get randomized IDs by default. You can set an ID using the output filter's document_id field, but it takes a literal string, not a field name. To use a field's contents, you could use an sprintf format string, such as %{property_id}. — rutter
– rutter, Commented Feb 12, 2014 at 0:43
@rutter - Can you create an answer using sprintf format string so I can mark it correct? That worked perfectly. Thank you. — Adrian J. Moreno
– Adrian J. Moreno, Commented Feb 12, 2014 at 16:08
Anothe example using checksums and %{document_id}: github.com/shurane/GettingStartedWithELK/blob/checksum/Examples/… — Ehtesh Choudhury
– Ehtesh Choudhury, Commented Jul 23, 2014 at 4:50
Thanks @sjngm, it's been relocated to github.com/coolacid/GettingStartedWithELK/blob/master/Examples/… — Ehtesh Choudhury
– Ehtesh Choudhury, Commented Sep 16, 2014 at 16:25

rutter · Accepted Answer · 2014-02-12 20:09:52Z

14

Converting from comment:

You can overwrite a document by sending another document with the same ID... but that might be tricky with your previous data, since you'll get randomized IDs by default.

You can set an ID using the output plugin's document_id field, but it takes a literal string, not a field name. To use a field's contents, you could use an sprintf format string, such as %{property_id}.

Something like this, for example:

output {
  elasticsearch {
    ... other settings...
    document_id => "%{property_id}"
  }
}

answered Feb 12, 2014 at 20:09

rutter

11.5k1 gold badge32 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Adrian J. Moreno Over a year ago

Thank you. Working code & output have been added to the question.

MosheZada · Accepted Answer · 2017-02-25 19:33:00Z

1

declaimer - I'm the author of ESL
You can use elasticsearch_loader to load psv files into elasticsearch.
In order to set the _id field you can use --id-field=property_id. for instance:
elasticsearch_loader --index=myindex --type=mytype --id-field=property_id csv --delimiter='|' filename.csv

answered Feb 25, 2017 at 19:33

MosheZada

2,4291 gold badge17 silver badges17 bronze badges

Comments

Akshay · Accepted Answer · 2014-02-12 10:53:26Z

0

Have you tried changing the config to this:

filter {
  csv {
    columns => ["_id","postal_code","address_1","city","state_code"]
    separator => "|"
  }
}

By naming property_id as _id it should get used during indexing.

answered Feb 12, 2014 at 10:53

Akshay

3,4111 gold badge24 silver badges19 bronze badges

2 Comments

Adrian J. Moreno Over a year ago

I should have mentioned that I tried this. For whatever reason, it didn't run the import at all.

eliasah Over a year ago

This answer in incorrect. It returns the following error : org.elasticsearch.index.mapper.MapperParsingException: failed to parse [_id]

Collectives™ on Stack Overflow

Importing and updating data in Elasticsearch

3 Answers 3

1 Comment

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related