I am running a HiveQL job on AWS EMR and receive the following error (below in code block). The instance has 39 M3.2XLarge (m3.2xlarge 8vCPU 30GB Memory 2 x 80GB SSD Storage) nodes, with a total 1.1TB of memory.
The HiveQL file loads data from S3 creating a smaller main data table in ORC format. There are quite a few intermediate tables that execute properly before the error. The code block that errored out was a select count(distinct ...) from <main data table>
Is there a way to clean/clear out memory before each new statement? Do I need to adjust the size of the heap? What else can I provide to help give a better sense of the data and environment?
Error...
Diagnostic Messages for this Task:
Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#1
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:381)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:56)
at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:46)
at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.<init>(InMemoryMapOutput.java:63)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:297)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:287)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:411)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:341)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
mapreduce.map.memory.mbandmapreduce.map.java.opts, alsomapreduce.reduce.memory.mbandmapreduce.reduce.java.opts? You can adjust these from within your Hive script with "SET" commands, as long as you don't bump into YARN limits (e.g.yarn.scheduler.maximum-allocation-mb)Configuration Option Default Value mapreduce.map.java.opts -Xmx1152m mapreduce.reduce.java.opts -Xmx2304m mapreduce.map.memory.mb 1440 mapreduce.reduce.memory.mb 2880 yarn.scheduler.minimum-allocation-mb 1440 yarn.scheduler.maximum-allocation-mb 23040 yarn.nodemanager.resource.memory-mb 23040Parameter Value YARN_RESOURCEMANAGER_HEAPSIZE 2703 YARN_PROXYSERVER_HEAPSIZE 2703 YARN_NODEMANAGER_HEAPSIZE 2048 HADOOP_JOB_HISTORYSERVER_HEAPSIZE 2703 HADOOP_NAMENODE_HEAPSIZE 3276 HADOOP_DATANODE_HEAPSIZE 1064