Spark: Exception in thread "main" org.apache.spark.sql.catalyst.errors.package

Question

While running my spark-submit code, I get this error when I execute.

Scala file which performs joins.

I am just curious to know what is this TreeNodeException error.

Why do we have this error?

Please share your ideas on this TreeNodeException error:

Exception in thread “main” org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:

Even i have the same is this fixed ? please answer the question — Sundeep
– Sundeep, Commented Sep 30, 2018 at 17:54

Pratik Goenka · Accepted Answer · 2019-12-27 04:44:20Z

Ok so the stack trace given above is not sufficient to understand the root cause, but as you mentioned you are using the join the most probably it's happening because of that. I faced the same issue for join, if you dig down your stack trace you would see something like -

+- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#73300L])
+- *Project
+- *BroadcastHashJoin 
...
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]

This gives hint why it's failing, Spark tries to join using "Broadcast Hash Join", which has Timeout and Broadcast size threshold, either of which causes above error.To fix this depending on underlying error -

Increase the "spark.sql.broadcastTimeout", default is 300 sec -

spark = SparkSession
  .builder
  .appName("AppName")
  .config("spark.sql.broadcastTimeout", "1800")
  .getOrCreate()

Or increase the broadcast threshold,default is 10 MB -

spark = SparkSession
      .builder
      .appName("AppName")
      .config("spark.sql.autoBroadcastJoinThreshold", "20485760 ")
      .getOrCreate()

Or disable the Broadcast join by setting value to -1

spark = SparkSession
          .builder
          .appName("AppName")
          .config("spark.sql.autoBroadcastJoinThreshold", "-1")
          .getOrCreate()

More details can be found here - https://spark.apache.org/docs/latest/sql-performance-tuning.html

Mauricio · Accepted Answer · 2019-03-27 20:43:49Z

1

I encountered this exception when joining dataframes too

Exception in thread “main” org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:

To fix it, I simply reversed the order of the join. That is, instead of doing df1.join(df2, on_col="A"), I did df2.join(df1, on_col="A"). Not sure why this is the case but my intuition tells me the logic tree that Spark must follow is messy when you use the former command but not the with the latter. You can think of it as the number of comparisons Spark would have to make with column "A" in my toy example to join both dataframes. I know it's not a definite answer but I hope it helps.

answered Mar 27, 2019 at 20:43

Mauricio

111 bronze badge

Collectives™ on Stack Overflow

Spark: Exception in thread "main" org.apache.spark.sql.catalyst.errors.package

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related