Optimize Hive Query. java.lang.OutOfMemoryError: Java heap space/GC overhead limit exceeded

Question

How can I optimize a query of this form since I keep running into this OOM error? Or come up with a better execution plan? If I removed the substring clause, the query would work fine, suggesting that this takes a lot of memory.

When the job fails, the beeline output shows the OOM Java heap space. Readings online suggested that I increase export HADOOP_HEAPSIZE but this still results in the error. Another thing I tried was increasing the hive.tez.container.size and hive.tez.java.opts (tez heap), but still has this error. In the YARN logs, there would be GC overhead limit exceeded, suggesting a combination of not enough memory and/or the query plan is extremely inefficient since it can't collect back enough memory.

I am using Azure HDInsight Interactive Query 4.0. 20 worker node, D13v2 8 core, and 56GB RAM.

Source table

create external table database.sourcetable(
  a,
  b,
  c,
  ...
  (183 total columns)
  ...
)
PARTITIONED BY ( 
  W string, 
  X int, 
  Y string, 
  Z int
)

Target Table

create external table database.NEWTABLE(
  a,
  b,
  c,
  ...
  (187 total columns)
  ...
  W,
  X,
  Y,
  Z
)
PARTITIONED BY (
  aAAA,
  bBBB
)

Query

insert overwrite table database.NEWTABLE partition(aAAA, bBBB, cCCC)
select
a,
b,
c,
...
(187 total columns)
...
W,
X,
Y,
Z,
cast(a as string) as aAAA, 
from_unixtime(unix_timestamp(b,'yyMMdd'),'yyyyMMdd') as bBBB,
substring(upper(c),1,2) as cCCC
from database.sourcetable

Hello, If the below answer is helpful for you, you can accept it as answer( click on the check mark beside the answer to toggle it from greyed out to filled in.). This can be beneficial to other community members. Thank you. — CHEEKATLAPRADEEP
– CHEEKATLAPRADEEP, Commented Jul 15, 2020 at 5:34

leftjoin · Accepted Answer · 2020-07-09 18:14:49Z

1

If everything else is okay, try to add distribute by partiton key at the end of your query:

  from database.sourcetable 
  distribute by aAAA, bBBB, cCCC

As a result each reducer will create only one partition file, consuming less memory

edited Jul 9, 2020 at 18:14

answered Jul 9, 2020 at 11:10

leftjoin

38.5k8 gold badges64 silver badges126 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user7644509 Over a year ago

I added the target table to give a clear picture. Is the target table and query correct in partitioning? And mind me ask, why is it that we distribute by the source table partition instead of the target table?

leftjoin Over a year ago

@user7644509 Sorry, missed it. It shorld be target partitioning. Read your question in a wrong way.. U r right

davidemm · Accepted Answer · 2020-07-10 13:58:49Z

0

Try sorting the partitioned columns:

SET hive.optimize.sort.dynamic.partition=true;

When enabled, dynamic partitioning column will be globally sorted. This way we can keep only one record writer open for each partition value in the reducer thereby reducing the memory pressure on reducers.

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

answered Jul 10, 2020 at 13:58

davidemm

2,0091 gold badge24 silver badges32 bronze badges

Collectives™ on Stack Overflow

Optimize Hive Query. java.lang.OutOfMemoryError: Java heap space/GC overhead limit exceeded

Source table

Target Table

Query

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Source table

Target Table

Query

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related