2

I noticed a strange behavior in ElasticSearch (version 5.5.0) where store.size decreased while docs.count increased. Why does this happen?

$ curl 'localhost:9201/_cat/indices/index-name:2017-08-08?bytes=b&v'
health status index                 uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   index-name:2017-08-08 PlpLYu5vTN-HFA_ygHUNwg  17   1    5577181       212434 3827072602     1939889776

$ curl 'localhost:9201/_cat/indices/index-name:2017-08-08?bytes=b&v'
health status index                 uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   index-name:2017-08-08 PlpLYu5vTN-HFA_ygHUNwg  17   1    5581202       204815 3812410150     1927833617

Note that while docs.count increased from 5577181->5581202, both store.size and pri.store.size decreased.

For background, I'm trying to use index size to throttle data going into ES (i.e. xGB per day). However, what I notice is that as I continue indexing, the index size decreases periodically (every hour or minutes or so). This is then not a good way to throttle since the storage size isn't strictly increasing

1) Any idea why the index size decreases? 2) Is there another size I should use which is strictly increasing?

EDIT: Actually even when there are no deleted documents the doc count still decreases. See below

$ curl -s localhost:9200/_cat/indices | grep name green open index-name:2017-08-11
eIGiDgeZQ5CqSu3tAaLRgw 1 1 111717 0 210.4mb 109.5mb $ curl -s localhost:9200/_cat/indices | grep name green open index-name:2017-08-11
eIGiDgeZQ5CqSu3tAaLRgw 1 1 132329 0 204.7mb 103.2mb

2 Answers 2

2

Elasticsearch cluster compresses indices over time - thus the _stats api operation may show index size shrinking (until it stops). Index maybe compressed even by 40% for similar docs.

EDIT: as mentioned above, the under the hood segment merge happens over time as long as docs are indexed. After each segment merge it appears (vaguely) that a compression happens on the new segment so assuming ES compression algo is a Linear Transformation then compress(A) + compress(B) >= compress(A+B) means that index size may decrease in size.

Sign up to request clarification or add additional context in comments.

Comments

1

So you have 4021 additional documents (=5581202-5577181) but you can also notice that the count of deleted documents docs.deleted decreased as well by 7619 documents (=212434-204815) so the net count of documents in your index is -3598. This is due to Lucene merging segments under the hood in order to clean up the deleted documents and try to regain some unused space.

That's the most probable reason why the overall index size decreased by 14662452 bytes (~14 MB)

If you want to throttle, you can use the docs.count instead, if you're constantly indexing, that number should increase.

14 Comments

Do you happen to know why there would be deleted documents then? I'm only indexing.
If you index a document a second time (i.e. with the same ID), the current document is marked as deleted and not replaced. Deleted documents are not necessarily ones that you have deleted via HTTP DELETE, but all older versions of existing documents. That's why Lucene regularly cleans up the index by merging segments and removing deleted documents.
Hmm I don't expect any updates since I should be indexing new docuemnts all the time, but I might have to go revisit my code to make sure. Thanks!
But 17 primary shards for your daily index of 3GB sounds like way more than would be necessary. A single shard is capable of holding several GB of data. Granted, I don't know your use case, though.
that's because the day has only started. The index will end up with roughly 17*25GB since I allocate 26GB to my data node
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.