How to really reindex data in elasticsearch

Question

I have added new mappings (mainly not_analyzed versions of existing fields) I now have to figure out how to reindex the existing data. I have tried following the guide on elastic search website but that is just too confusing. I have also tried using plugins (elasticsearch-reindex, allegro/elasticsearch-reindex-tool). I have looked at ElasticSearch - Reindexing your data with zero downtime which is a similar question. I was hoping to not have to rely on external tools (if possible) and try and use bulk API (as with original insert)

I could easily rebuild the whole index as it's a read only data really but that wont really work in the long term if I should want to add more fields etc etc when I'm in production with it. I wondered if there was anyone who knows of an easy to understand/follow solution or steps for a relative novice to ES. I'm on version 2 and using Windows.

What point version of ElasticSearch are you using? If you are using 2.3, the native _reindex api is available. It can do precisely what you're looking for. I'm not sure which guide you are referring to ("the guide on elastic search website") but this is the docs on the reindex api: elastic.co/guide/en/elasticsearch/reference/current/… If I'm not mistaken, you can reindex into the same index, effectively leaving the data in place. There are document version issues you have to be aware of though. — Jeff Gandt
– Jeff Gandt, Commented Jul 18, 2016 at 17:23
Yeah I had this problem some months ago but I too noticed the reindex API being available... Wasn't able to verify if you can reindex into the same index — metase
– metase, Commented Jul 19, 2016 at 19:46
Here is a small process for creating new mappings on an existing index (with re-index): codeburst.io/… — rap-2-h
– rap-2-h, Commented Nov 5, 2018 at 17:06

dtrv · Accepted Answer · 2022-09-07 13:20:39Z

29

Re-indexing means to read the data, delete the data in elasticsearch and ingest the data again. There is no such thing like "change the mapping of existing data in place." All the re-indexing tools you mentioned are just wrappers around read->delete->ingest.
You can always adjust the mapping for new indices and add fields later. All the new fields will be indexed with respect to this mapping. Or use dynamic mapping if you are not in control of the new fields.
Have a look at Change default mapping of string to "not analyzed" in Elasticsearch to see how to use dynamic mapping to get not_analyzed fields of strings.

Re-indexing is very expensive. Better way is to create a new index and drop the old one. To achieve this with zero downtime, use index alias for all your customers. Think of an index called "data-version1". In steps:

create your index "data-version1" and give it an alias named "data"
only use the alias "data" in all your client applications
to update your mapping: create a new index (with the new mapping) called "data-version2" and put all your data in (you can use the _reindex API for that)
to switch from version1 to version2: drop the alias "data" on version1 and create an alias "data" on version2 (or first create, then drop). the time in between those two steps your clients will have no (or double) data. but the time between dropping and creating an alias should be so short your clients shouldn't recognize it.

It's good practice to always use aliases.

edited Sep 7, 2022 at 13:20

answered Nov 22, 2015 at 21:17

dtrv

7511 gold badge6 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

metase Over a year ago

Thanks for replying. I wanted to lean more towards the "zero downtime" approach. I can push in another dataset in again which will take 15-20 mins with a new version of mapping with both analyzed and not analyzed fields present.(that is a back up plan). Really I wanted to explore the other option of not having to do that when I'm in production

dtrv Over a year ago

You can add a new mapping only if you create a new index - sorry that wasn't clear in my post. I added this above. Most users have separate indices for each period in time (lets say daily). Then new fields and/or new mappings are applied on all new created indices. I also add some thought for zero downtime in the post.

cah1r Over a year ago

@dtrv do you know what would happen if new data was indexed by clients into "data-version1" while the reindexing command was running ? Will it get also picked up ?

dtrv Over a year ago

@cah1r Clients should use alias "data" and not index into "version1". Indexing into an alias is possible if there is only one index with that alias. And with that it's clear where the new data get indexed: into index "version1" before alias switching and into "version2" afterwards. You could also set the index to read-only to avoid adding new data after the reindex process has been started.

dtrv Over a year ago

If an alias has more than one index and you want to index via that alias, see property is_write_index of the _alias API.

|

j0k · Accepted Answer · 2017-10-25 08:02:50Z

17

With version 2.3.4 a new api _reindex is available which will do exactly what it says. Basic usage is

{
    "source": {
        "index": "currentIndex"
    },
    "dest": {
        "index": "newIndex"
    }
}

edited Oct 25, 2017 at 8:02

j0k

22.8k28 gold badges81 silver badges90 bronze badges

answered Jul 21, 2016 at 21:21

metase

1,2992 gold badges17 silver badges31 bronze badges

2 Comments

Jeff Gandt Over a year ago

You could reindex from "currentIndex" to a temporary index and then back to "currentIndex". You can use the op_type and version_type parameters to control how you handle duplicates/overwriting data.

metase Over a year ago

That's what I ended up doing

DavidBu · Accepted Answer · 2021-11-18 10:30:38Z

7

If you want like me a straight answer to this common and basic problem which is poorly adressed by elastic and the community in general, here is the code that works for me.

Assuming you are just debugging, not in a production environment, and it is absolutely legitimate to add or remove fields because you absolutely don't care about downtime or latency:

# First of all: enable blocks write to enable clonage
PUT /my_index/_settings
{
  "settings": {
    "index.blocks.write": true
  }
}

# clone index into a temporary index
POST /my_index/_clone/my_index-000001  

# Copy back all documents in the original index to force their reindexetion
POST /_reindex
{
  "source": {
    "index": "my_index-000001"
  },
  "dest": {
    "index": "my_index"
  }
}

# Disable blocks write
PUT /my_index/_settings
{
  "settings": {
    "index.blocks.write": false
  }
}

# Finaly delete the temporary index
DELETE my_index-000001

answered Nov 18, 2021 at 10:30

DavidBu

5285 silver badges6 bronze badges

2 Comments

tamersalama Over a year ago

Is there a typo in source and dest (it should be the other way around) in copying all documents from the original index

mike rodent Over a year ago

@tamersalama No, it's correct as written. "index-000001" is the newly created clone. You want to flow all that data back into the original index, "my_index".

smack cherry · Accepted Answer · 2020-01-29 11:02:47Z

Elasticsearch Reindex from Remote host to Local Host example (Jan 2020 Update)

# show indices on this host
curl 'localhost:9200/_cat/indices?v'

# edit elasticsearch configuration file to allow remote indexing
sudo vi /etc/elasticsearch/elasticsearch.yml

## copy the line below somewhere in the file
>>>
# --- whitelist for remote indexing ---
reindex.remote.whitelist: my-remote-machine.my-domain.com:9200
<<<

# restart elaticsearch service
sudo systemctl restart elasticsearch

# run reindex from remote machine to copy the index named filebeat-2016.12.01
curl -H 'Content-Type: application/json' -X POST 127.0.0.1:9200/_reindex?pretty -d'{
  "source": {
    "remote": {
      "host": "http://my-remote-machine.my-domain.com:9200"
    },
    "index": "filebeat-2016.12.01"
  },
  "dest": {
    "index": "filebeat-2016.12.01"
  }
}'

# verify index has been copied
curl 'localhost:9200/_cat/indices?v'

mike rodent · Accepted Answer · 2025-01-06 16:01:03Z

Using an alias makes this very easy with no downtime issues and aliases are intended for this very scenario.

Using an alias means you have to do a small amount of housekeeping (i.e. deleting obsolete "real" indices) but this is pretty minimal. And you may also avoid having to lock and then unlock "index.blocks.write" on any index.

E.g. if you have a (real) index "my_real_index.2024-01-04.1" pointing to the alias "my_alias" and you've created a new real index "my_real_index.2024-01-04.2" with fresh new settings and mappings or whatever:

POST {ES_URL}/_reindex
{
    "source": {
        "index": "my_real_index.2024-01-04.1"
    },
    "dest": {
        "index": "my_real_index.2024-01-04.2"
    }
}

Then switch where the alias takes you:

POST {ES_URL}/_aliases
{
    "actions": [
        {"add": {"index": "my_real_index.2024-01-04.2", "alias": "my_alias"}},
        {"remove": {"index": "my_real_index.2024-01-04.1", "alias": "my_alias"}},
    ]
}

NB the above POST operation is atomic: in the blink of any eye the new real index will be used instead for someone using {ES_URL}/my_alias/....

NB2 it is true that any changes to the contents of "my_real_index.2024-01-04.1" occurring between the above two POST operations would then be lost in the new index. If that it is indeed a concern, lock "index.blocks.write" on the index which is being superseded before reindexing:

PUT {ES_URL}/my_real_index.2024-01-04.1/_settings
{
  "settings": {
    "index.blocks.write": true
  }
}

Evidently, in the normal course of things there is then no need to unlock my_real_index.2024-01-04.1 at any time because it is now no longer in use and can be deleted at a suitable time (you may need to put my_real_index.2024-01-04.2, the new real index, through some testing, possibly... and in the worst case, revert to the old one).

The link to elasticsearch aliases doc is nice, it shows the code example in different languages like Ruby, Python, JavaScript, and cURL. Thanks!

mehmet.onler · Accepted Answer · 2015-11-23 12:08:47Z

0

I faced same problem. But i couldn't find any resource to update current index mapping and analyzer. My suggestion is to use scroll and scan api and reindex your data to new index with new mapping and new fields.

answered Nov 23, 2015 at 12:08

mehmet.onler

216 bronze badges

Collectives™ on Stack Overflow

How to really reindex data in elasticsearch

6 Answers 6

7 Comments

2 Comments

2 Comments

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

7 Comments

2 Comments

2 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related