Elasticsearch: restart node after java.lang.OutOfMemoryError: Java heap space

Question

One of my ES nodes has failed because of java.lang.OutOfMemoryError: Java heap space error. Here is the full stack trace from the logs:

    [2020-09-18T04:25:04,215][WARN ][o.e.a.b.TransportShardBulkAction] [search1] [[my_index_4][0]] failed to perform indices:data/write/bulk[s] on replica [my_index_4][0], node[cm_76wfGRFm9nbPR1mJxTQ], [R], s[STARTED], a[id=BUpviwHxQK2qC3GrELC2Hw]
org.elasticsearch.transport.NodeDisconnectedException: [search3][X.X.X.179:9300][indices:data/write/bulk[s][r]] disconnected
[2020-09-18T04:25:04,215][WARN ][o.e.c.a.s.ShardStateAction] [search1] [my_index_4][0] received shard failed for shard id [[my_index_4][0]], allocation id [BUpviwHxQK2qC3GrELC2Hw], primary term [2], message [failed to perform indices:data/write/bulk[s] on replica [my_index_4][0], node[cm_76wfGRFm9nbPR1mJxTQ], [R], s[STARTED], a[id=BUpviwHxQK2qC3GrELC2Hw]], failure [NodeDisconnectedException[[search3][X.X.X.179:9300][indices:data/write/bulk[s][r]] disconnected]]
org.elasticsearch.transport.NodeDisconnectedException: [search3][X.X.X.179:9300][indices:data/write/bulk[s][r]] disconnected
[2020-09-18T04:25:04,215][DEBUG][o.e.a.a.c.n.i.TransportNodesInfoAction] [search1] failed to execute on node [cm_76wfGRFm9nbPR1mJxTQ]
org.elasticsearch.transport.NodeDisconnectedException: [search3][X.X.X.179:9300][cluster:monitor/nodes/info[n]] disconnected
[2020-09-18T04:25:04,219][INFO ][o.e.c.r.a.AllocationService] [search1] Cluster health status changed from [GREEN] to [YELLOW] (reason: [shards failed [[my_index_4][0]] ...]).
[2020-09-18T04:25:05,450][INFO ][o.e.m.j.JvmGcMonitorService] [search1] [gc][11099506] overhead, spent [605ms] collecting in the last [1.4s]
[2020-09-18T04:25:05,453][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [search1] fatal error in thread [elasticsearch[search1][search][T#5]], exiting
java.lang.OutOfMemoryError: Java heap space
at org.elasticsearch.search.aggregations.bucket.composite.CompositeValuesSource$GlobalOrdinalValuesSource.<init>(CompositeValuesSource.java:137) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.bucket.composite.CompositeValuesSource.wrapGlobalOrdinals(CompositeValuesSource.java:123) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.bucket.composite.CompositeValuesComparator.<init>(CompositeValuesComparator.java:50) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.bucket.composite.CompositeAggregator.<init>(CompositeAggregator.java:69) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.bucket.composite.CompositeAggregationFactory.createInternal(CompositeAggregationFactory.java:52) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.AggregatorFactory.create(AggregatorFactory.java:216) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.AggregatorFactories.createTopLevelAggregators(AggregatorFactories.java:216) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.AggregationPhase.preProcess(AggregationPhase.java:55) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:105) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$14(IndicesService.java:1133) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesService$$Lambda$2241/341562582.accept(Unknown Source) ~[?:?]
at org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$15(IndicesService.java:1186) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesService$$Lambda$2242/1286052129.get(Unknown Source) ~[?:?]
at org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:160) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:143) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:412) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:116) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1192) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1132) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:305) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:340) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:316) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:312) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.SearchService$3.doRun(SearchService.java:1002) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.4.jar:6.2.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_171]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_171]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]

Because of the exception above, I am getting master_not_discovered_exception when I am hitting any of ES APIs.

Question: Can anyone tell me the next steps that I should perform to put Elasticsearch back to normal state? Is there a way to restart disconnected node?

this error means that whatever the process - is now dead (I am not very familiar with cassandra), so it seems you need to start this process again? — Eugene
– Eugene, Commented Sep 18, 2020 at 15:12

Martijn Pieters · Accepted Answer · 2021-02-26 16:57:22Z

2

First let me briefly explains what might have caused this issue:

As mentioned in the logs, you seems to be running costly aggregation, which are in general memory intensive and known to consume a lot of memory, which your Garbage collection(GC) was not able to reclaim, and eventually your application(ES) ran out of memory and got killed.
Apart from costly aggregations which is shown in the logs, high memory consumption can also be caused by heavy searches and indexing request, so please have a look at this node's both search and index slow logs, refer ES slow logs for more info

Now coming to resolution part

This ES node is dead, which is causing master_not_discovered_exception hence its important to bring restart this node again and see if this exception goes.

Prevention of OOM exception

You should properly configure the circuit breaker available in ES and if possible upgrade to ES 7.X which has better circuit breakers based on real-memory
Improve ES indexing and search performance.

edited Feb 26, 2021 at 16:57

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

answered Sep 19, 2020 at 5:43

Amit

32.5k7 gold badges68 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Eager Over a year ago

Thanks for your answer. Indeed, I was playing with composite aggregation size parameter. I set its value to Integer.MAX_VALUE and according to stack trace, exception was thrown while initialization of agg values array this.values = new long[size];, here the size param is taken from the query

Amit Over a year ago

@I.Domshchikov glad to hear that my answer was helpful, I see in your answer you mentioned abt your next steps and waiting for your update

Eager Over a year ago

QQ for u, related to ES availability under high load cause by my app. My ES heap settings are -Xms1g -Xmx1g. My App uses composite query to fetch unique filters values, to allow users to filter Foo entities (by city, name...). Usually, there are 3 filters available, therefore app sends 3 composite queries in parallel to get the unique values. For each filter I want to retrieve all its unique values, and because I don't know how many of them exists in advance, I am setting size param equals to Foo entities count (7-10k). Do you think I may face the OutOfMemoryError again?

Amit Over a year ago

@I.Domshchikov 1GB heap for ES is really less and you want to use it for aggregations and huge size param causes more memory consumption so you might get again OOM error

Eager · Accepted Answer · 2020-09-19 07:10:31Z

0

The java.lang.OutOfMemoryError: Java heap space was caused by running the composite aggregation query for which I set the size parameter to Integer.MAX_VALUE:

{
    "size": 0,

    "aggregations": {
        "myParam.keyword": {
            "composite": {
                "size": 2147483647,
                "sources": [
                    {
                        "myParam.keyword": {
                            "terms": {
                                "field": "myParam.keyword",
                                "order": "asc"
                            }
                        }
                    }
                ]
            }
        }
    }
}

According to stack trace, the error occurred while initialization of aggregation values array CompositeValuesSource.java:137:

GlobalOrdinalValuesSource(ValuesSource.Bytes.WithOrdinals vs, int size, int reverseMul) {
    super(vs, size, reverseMul);
    this.values = new long[size];
}

Here, the size parameter is coming from the query.

The answer https://stackoverflow.com/a/63965634/5284890 confirms the root cause.

My next step was stopping and running Elasticsearcch again using the following commands

sudo systemctl stop elasticsearch.service
sudo systemctl start elasticsearch.service

My following steps will be to check suggested circuit breaker in ES article mentioned in this answer https://stackoverflow.com/a/63965634/5284890.

answered Sep 19, 2020 at 7:10

Eager

1,7114 gold badges37 silver badges65 bronze badges

3 Comments

Eager Over a year ago

@OpsterElasticsearchNinja, working on it right now. I increased the heap size to 3Gb, and also refactor the app logic that builds composite agg query to use pagination (elastic.co/guide/en/elasticsearch/reference/current/…) - which is designed to cover the case when I don't know number of elements to be returned in advance.

Eager Over a year ago

@OpsterElasticsearchNinja, sorry for delayed answer. So far, we only did upgrade of ES cluster. We have 3 nodes in it. Each node jvm heap was increased to 16GB. We followed this rules to increase it elastic.co/guide/en/elasticsearch/reference/current/…. After that the load test was performed, and the cluster was able to handle the required load. Though, we haven't updated ES yet, the decision is to do it later. Once it will be done, I will work on circuit breaker stuff. Thanks for you help and support with solving this issue.

Amit Over a year ago

No worries, glad to hear back from you :)

Collectives™ on Stack Overflow

Elasticsearch: restart node after java.lang.OutOfMemoryError: Java heap space

2 Answers 2

4 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related