2

One of my ES nodes has failed because of java.lang.OutOfMemoryError: Java heap space error. Here is the full stack trace from the logs:

    [2020-09-18T04:25:04,215][WARN ][o.e.a.b.TransportShardBulkAction] [search1] [[my_index_4][0]] failed to perform indices:data/write/bulk[s] on replica [my_index_4][0], node[cm_76wfGRFm9nbPR1mJxTQ], [R], s[STARTED], a[id=BUpviwHxQK2qC3GrELC2Hw]
org.elasticsearch.transport.NodeDisconnectedException: [search3][X.X.X.179:9300][indices:data/write/bulk[s][r]] disconnected
[2020-09-18T04:25:04,215][WARN ][o.e.c.a.s.ShardStateAction] [search1] [my_index_4][0] received shard failed for shard id [[my_index_4][0]], allocation id [BUpviwHxQK2qC3GrELC2Hw], primary term [2], message [failed to perform indices:data/write/bulk[s] on replica [my_index_4][0], node[cm_76wfGRFm9nbPR1mJxTQ], [R], s[STARTED], a[id=BUpviwHxQK2qC3GrELC2Hw]], failure [NodeDisconnectedException[[search3][X.X.X.179:9300][indices:data/write/bulk[s][r]] disconnected]]
org.elasticsearch.transport.NodeDisconnectedException: [search3][X.X.X.179:9300][indices:data/write/bulk[s][r]] disconnected
[2020-09-18T04:25:04,215][DEBUG][o.e.a.a.c.n.i.TransportNodesInfoAction] [search1] failed to execute on node [cm_76wfGRFm9nbPR1mJxTQ]
org.elasticsearch.transport.NodeDisconnectedException: [search3][X.X.X.179:9300][cluster:monitor/nodes/info[n]] disconnected
[2020-09-18T04:25:04,219][INFO ][o.e.c.r.a.AllocationService] [search1] Cluster health status changed from [GREEN] to [YELLOW] (reason: [shards failed [[my_index_4][0]] ...]).
[2020-09-18T04:25:05,450][INFO ][o.e.m.j.JvmGcMonitorService] [search1] [gc][11099506] overhead, spent [605ms] collecting in the last [1.4s]
[2020-09-18T04:25:05,453][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [search1] fatal error in thread [elasticsearch[search1][search][T#5]], exiting
java.lang.OutOfMemoryError: Java heap space
at org.elasticsearch.search.aggregations.bucket.composite.CompositeValuesSource$GlobalOrdinalValuesSource.<init>(CompositeValuesSource.java:137) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.bucket.composite.CompositeValuesSource.wrapGlobalOrdinals(CompositeValuesSource.java:123) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.bucket.composite.CompositeValuesComparator.<init>(CompositeValuesComparator.java:50) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.bucket.composite.CompositeAggregator.<init>(CompositeAggregator.java:69) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.bucket.composite.CompositeAggregationFactory.createInternal(CompositeAggregationFactory.java:52) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.AggregatorFactory.create(AggregatorFactory.java:216) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.AggregatorFactories.createTopLevelAggregators(AggregatorFactories.java:216) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.AggregationPhase.preProcess(AggregationPhase.java:55) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:105) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$14(IndicesService.java:1133) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesService$$Lambda$2241/341562582.accept(Unknown Source) ~[?:?]
at org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$15(IndicesService.java:1186) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesService$$Lambda$2242/1286052129.get(Unknown Source) ~[?:?]
at org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:160) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:143) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:412) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:116) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1192) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1132) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:305) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:340) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:316) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:312) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.SearchService$3.doRun(SearchService.java:1002) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.4.jar:6.2.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_171]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_171]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]

Because of the exception above, I am getting master_not_discovered_exception when I am hitting any of ES APIs.

Question: Can anyone tell me the next steps that I should perform to put Elasticsearch back to normal state? Is there a way to restart disconnected node?

1
  • this error means that whatever the process - is now dead (I am not very familiar with cassandra), so it seems you need to start this process again? Commented Sep 18, 2020 at 15:12

2 Answers 2

2

First let me briefly explains what might have caused this issue:

  1. As mentioned in the logs, you seems to be running costly aggregation, which are in general memory intensive and known to consume a lot of memory, which your Garbage collection(GC) was not able to reclaim, and eventually your application(ES) ran out of memory and got killed.
  2. Apart from costly aggregations which is shown in the logs, high memory consumption can also be caused by heavy searches and indexing request, so please have a look at this node's both search and index slow logs, refer ES slow logs for more info

Now coming to resolution part

This ES node is dead, which is causing master_not_discovered_exception hence its important to bring restart this node again and see if this exception goes.

Prevention of OOM exception

  1. You should properly configure the circuit breaker available in ES and if possible upgrade to ES 7.X which has better circuit breakers based on real-memory
  2. Improve ES indexing and search performance.
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for your answer. Indeed, I was playing with composite aggregation size parameter. I set its value to Integer.MAX_VALUE and according to stack trace, exception was thrown while initialization of agg values array this.values = new long[size];, here the size param is taken from the query
@I.Domshchikov glad to hear that my answer was helpful, I see in your answer you mentioned abt your next steps and waiting for your update
QQ for u, related to ES availability under high load cause by my app. My ES heap settings are -Xms1g -Xmx1g. My App uses composite query to fetch unique filters values, to allow users to filter Foo entities (by city, name...). Usually, there are 3 filters available, therefore app sends 3 composite queries in parallel to get the unique values. For each filter I want to retrieve all its unique values, and because I don't know how many of them exists in advance, I am setting size param equals to Foo entities count (7-10k). Do you think I may face the OutOfMemoryError again?
@I.Domshchikov 1GB heap for ES is really less and you want to use it for aggregations and huge size param causes more memory consumption so you might get again OOM error
0

The java.lang.OutOfMemoryError: Java heap space was caused by running the composite aggregation query for which I set the size parameter to Integer.MAX_VALUE:

{
    "size": 0,

    "aggregations": {
        "myParam.keyword": {
            "composite": {
                "size": 2147483647,
                "sources": [
                    {
                        "myParam.keyword": {
                            "terms": {
                                "field": "myParam.keyword",
                                "order": "asc"
                            }
                        }
                    }
                ]
            }
        }
    }
} 

According to stack trace, the error occurred while initialization of aggregation values array CompositeValuesSource.java:137:

GlobalOrdinalValuesSource(ValuesSource.Bytes.WithOrdinals vs, int size, int reverseMul) {
    super(vs, size, reverseMul);
    this.values = new long[size];
}

Here, the size parameter is coming from the query.

The answer https://stackoverflow.com/a/63965634/5284890 confirms the root cause.

My next step was stopping and running Elasticsearcch again using the following commands

sudo systemctl stop elasticsearch.service
sudo systemctl start elasticsearch.service

My following steps will be to check suggested circuit breaker in ES article mentioned in this answer https://stackoverflow.com/a/63965634/5284890.

3 Comments

@OpsterElasticsearchNinja, working on it right now. I increased the heap size to 3Gb, and also refactor the app logic that builds composite agg query to use pagination (elastic.co/guide/en/elasticsearch/reference/current/…) - which is designed to cover the case when I don't know number of elements to be returned in advance.
@OpsterElasticsearchNinja, sorry for delayed answer. So far, we only did upgrade of ES cluster. We have 3 nodes in it. Each node jvm heap was increased to 16GB. We followed this rules to increase it elastic.co/guide/en/elasticsearch/reference/current/…. After that the load test was performed, and the cluster was able to handle the required load. Though, we haven't updated ES yet, the decision is to do it later. Once it will be done, I will work on circuit breaker stuff. Thanks for you help and support with solving this issue.
No worries, glad to hear back from you :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.