0

So, we are in a situation where we need to reindex Elasticsearch documents from one index to another. We're using the reindex API for that. Though sometimes the document already exists in the destination index. Setting version_type: "external" makes it so that the document in the destination index is updated which works great, except that it performs a full update, I'd like it to perform a partial update on that document. Something like setting ctx.op = "partial" would be nice but it's apparently not implemented as of today. Any alternative ideas for achieving this would be appreciated.

PS: I'd like to avoid to query the source index for every documents and sending them individually to the destination with upsert, for performances reasons it seems that would be quite slow compared to the reindex API.

1 Answer 1

1

Disclaimer: this answer has been updated.

To achieve a partial update you may define a script).

In theory you may apply any transformation you want to the document being reindexed.

(End of original answer.)


Implementing custom reindex-and-merge

As the author of the question pointed out, it does not help if one needs to merge two documents, the one already existing in the resulting index and a new one.

Elasticsearch _reindex method was introduced in version 2.3 and was considered experimental; it looks like it was simply a combination of a scroll query with bulk insert API. I make this conclusion based on the fact that this page in Definitive Guide suggests to reindex your data in this way:

To reindex all of the documents from the old index efficiently, use scroll to retrieve batches of documents from the old index, and the bulk API to push them into the new index.

Now, to address the need of partial update. The process of reindex-and-merge can be roughly divided into four stages:

  1. reading document from the index A
  2. reading document from the index B
  3. merging documents
  4. inserting new document into B

Stages 1 and 4 are actually an original scenario of reindex call; what makes it different now is the need to join with another index and merge the documents.

I would propose to write a custom script and use scroll for reading the index A in streaming fashion, bulk API for retrieving documents from the index B, custom code for merging documents and bulk API for inserting documents. Performance of such script will be at least comparable with original reindex implementation. (Also make sure that you check out this page with index performance tuning tips, in particular increase/disable index.refresh_interval.)

There are of course other options, that are not relevant to ElasticSearch and which the author of this question might have already considered (like dumping both indexes, joining them with custom code and inserting the new index).

Hope this helps.

Sign up to request clarification or add additional context in comments.

2 Comments

Yes but unless I'm mistaken, that transformation will be applied to the data coming from the source index, not to the destination document itself
@SebScoFr Yes, you are right. I will delete my answer since it is not relevant.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.