Why does Spark job fail while executing multiple Hive scripts using spark-sql in parallel?

Question

I have 25 hive scripts with 200 hive queries in each. I am running each hql using spark-sql command in my aws emr cluster. I am running all spark-sql commands in parallel using & operator. I am able to run same hqls successfully using hive on tez. I am trying the same using spark-sql to improve performance. But, with spark-sql only 2-3 scripts are executing fine; remaining hqls are failing with connection set by peer error. I believe it is because of lack of resources in yarn cluster for spark.

When I observed YARN console, I could see it is utilizing full memory of the cluster even though I specified executor and driver memory in the command.

Can some one help me out exact reason for this issue ?

Below is my EMR cluster configuration:

Data Nodes : 6
RAM per Node : 56 GB
Cores per Node: 32
Instance Type: M4*4xLarge

command used in unix:

spark-sql --master yarn --num-executors 12 --executor-memory 20G --executor-cores 15 --driver-memory 10G -f  hql1.hql & spark-sql --master yarn --num-executors 12 --executor-memory 20G --executor-cores 15 --driver-memory 10G -f  hql2.hql & spark-sql --master yarn --num-executors 12 --executor-memory 20G --executor-cores 15 --driver-memory 10G -f  hql3.hql ..... spark-sql --master yarn --num-executors 12 --executor-memory 20G --executor-cores 15 --driver-memory 10G -f  hql25.hql

When I run all above commands in parallel only 2 to 3 jobs are executing correctly and remaining are failed withe below error.

 05:>            (0 + 0) / 30800]^M[Stage 904:=>       (6818 + 31) / 30800][Stage 905:>            (0 + 0) / 30800]^M[Stage 904:==>      (7743 + 31) / 30800][Stage 905:>            (0 + 0) / 30800]^M[Stage 904:==>      (8271 + 32) / 30800][Stage 905:>            (0 + 0) / 30800]17/04/13 11:35:10 WARN TransportChannelHandler: Exception in connection from /10.134.22.114:47550
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
        at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
        at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:745)
17/04/13 11:35:10 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /10.134.22.114:47550 is closed
17/04/13 11:35:10 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint: Sending RequestExecutors(53329,61600,Map(ip-10-134-22-6.eu-central-1.compute.internal -> 12262, ip-10-134-22-67.eu-central-1.compute.internal -> 16940, ip-10-134-22-106.eu-central-1.compute.internal -> 17876, ip-10-134-22-46.eu-central-1.compute.internal -> 16400, ip-10-134-22-114.eu-central-1.compute.internal -> 14902, ip-10-134-22-105.eu-central-1.compute.internal -> 44820)) to AM was unsuccessful
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
        at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
        at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)

Jacek Laskowski · Accepted Answer · 2017-04-20 13:59:28Z

I believe it is because of lack of resources in yarn cluster for spark.

I think so too and would strongly recommend using YARN UI to see how the resources are used.

Regardless of what you see in YARN UI, I did some calculation and it appears that you do have too few resources to have all 25 scripts running at the same time.

Given...

Data Nodes : 6
RAM per Node : 56 GB
Cores per Node: 32
Instance Type: M4*4xLarge

it appears that you've got 6 x 56 GB = 336 GB and 6 x 32 cores = 192 cores.

After the following command:

spark-sql --master yarn --num-executors 12 --executor-memory 20G --executor-cores 15 --driver-memory 10G -f hql1.hql

you've reserved 240 GB and 180 cores which is more than half the resources available and it's only for the first spark-sql.

I think the issue is with the single & that puts spark-sql in background and given you've got 25 spark-sql you see the issue with missing resources. I'm not suprised.

Vinay Kumar Dudi · Accepted Answer · 2017-04-20 16:43:12Z

0

Changing spark dynamic memory allocation to false should resolve the issue.

Even though we set executor memory in the command, spark allocates memory dynamically if resources are available in the cluster. To restrict the memory usage to only executor memory, spark dynamic memory allocation parameter should set to false.

You can change it directly in spark config file or pass as config parameter to the command.

spark-sql --master yarn --num-executors 1 --executor-memory 20G --executor-cores 20 --driver-memory 4G --conf spark.dynamicAllocation.enabled=false -f hive1.hql

answered Apr 20, 2017 at 16:43

Vinay Kumar Dudi

1372 silver badges12 bronze badges

Collectives™ on Stack Overflow

Why does Spark job fail while executing multiple Hive scripts using spark-sql in parallel?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related