'Exception in thread "dispatcher-event-loop-0" java.lang.OutOfMemoryError: Java heap space ' error in Spark Scala code

Question

val data = spark.read
    .text(filePath)
    .toDF("val")
    .withColumn("id", monotonically_increasing_id())



    val count = data.count()



    val header = data.where("id==1").collect().map(s => s.getString(0)).apply(0)



    val columns = header
    .replace("H|*|", "")
    .replace("|##|", "")
    .split("\\|\\*\\|")


    val structSchema = StructType(columns.map(s=>StructField(s, StringType, true)))



    var correctData = data.where('id > 1 && 'id < count-1).select("val")
    var dataString = correctData.collect().map(s => s.getString(0)).mkString("").replace("\\\n","").replace("\\\r","")
    var dataArr = dataString.split("\\|\\#\\#\\|").map(s =>{ 
                                                          var arr = s.split("\\|\\*\\|")
                                                          while(arr.length < columns.length) arr = arr :+ ""
                                                          RowFactory.create(arr:_*)
                                                         })
    val finalDF = spark.createDataFrame(sc.parallelize(dataArr),structSchema)

    display(finalDF)

This portion of code giving error:

Exception in thread "dispatcher-event-loop-0" java.lang.OutOfMemoryError: Java heap space

After hours of debugging mainly the part:

var dataArr = dataString.split("\\|\\#\\#\\|").map(s =>{ 
                                                          var arr = s.split("\\|\\*\\|")
                                                          while(arr.length < columns.length) arr = arr :+ ""
                                                          RowFactory.create(arr:_*)
                                                         })
    val finalDF = spark.createDataFrame(sc.parallelize(dataArr),structSchema)

causing the error.

I changed the part as

var dataArr = dataString.split("\\|\\#\\#\\|").map(s =>{
                                                          var arr = s.split("\\|\\*\\|")
                                                          while(arr.length < columns.length) arr = arr :+ ""
                                                          RowFactory.create(arr:_*)
                                                         }).toList
  val finalDF = sqlContext.createDataFrame(sc.makeRDD(dataArr),structSchema)

But error remains same. What should I change to avoid this?

When I ran this code is databricks spark cluster, particular job gives this Spark driver error:

Job aborted due to stage failure: Serialized task 45:0 was 792585456 bytes, which exceeds max allowed: spark.rpc.message.maxSize (268435456 bytes).

I added this portion of code:

spark.conf.set("spark.rpc.message.maxSize",Int.MaxValue)

but of no use.

Please read Under what circumstances may I add “urgent” or other similar phrases to my question, in order to obtain faster answers? - the summary is that this is not an ideal way to address volunteers, and is probably counterproductive to obtaining answers. Please refrain from adding this to your questions. — halfer
– halfer, Commented Feb 29, 2020 at 14:53

Raphael Roth · Accepted Answer · 2020-02-29 20:06:48Z

2

My guess is that

var dataString = correctData.collect().map(s => s.getString(0)).mkString("").replace("\\\n","").replace("\\\r","")

is the problem, because you collect (almost) all of the data to the driver, i.e. to 1 single JVM.

Maybe this line runs, but subsequent operations on dataString will exceed your memory limits. You should not collect your data! Instead, work with distributed "data structures" such as Dataframe or RDD.

I think you could just omit the collect in the above line

answered Feb 29, 2020 at 20:06

Raphael Roth

27.3k19 gold badges98 silver badges152 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Sayantan Over a year ago

Code is failing if I remove collect . As i am pretty new in this domain, can you help me on this.What should I do exactly here

Collectives™ on Stack Overflow

'Exception in thread "dispatcher-event-loop-0" java.lang.OutOfMemoryError: Java heap space ' error in Spark Scala code

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related