4

The error can be reproduced in the spark-shell. Basically, I define a class with a method for producing an RDD and then I perform a map operation on the RDD which generates the serialization error.

If I don't have the method and just have statements that perform the steps of the method, then everything works.

The code is here which can be run in spark-shell I define a class and then I instantiate the class.

First the imports

import java.nio.file.{Files}
import java.io._
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import scala.io.Source
import scala.collection.mutable.ArrayBuffer

Here is the class with a method and a map with anonymous function to count separators in a string

class DataValidation(datasetPath: String, datasetName: String, separator:    
String, encoding: String, sc: SparkContext) extends Serializable {

// open file and load distribute... and try playing around with it...
// RDD data declaration, reading dataset on RDD without header

var dataset = datasetPath + "//" + datasetName + ".csv"

def textfile_encoding(datasetIn: String, encodingIn: String) : RDD[String] = {
var characters = ArrayBuffer[String]()
    for (line <- Source.fromFile(dataset, encoding).getLines()) {
         characters += line
    }
sc.parallelize(characters)

}
val rdd = this.textfile_encoding(dataset,encoding)
val separatorCount = rdd.map(_.count(_ == separator.charAt(0))) //error here
println("here")
}

Then here are the calling statements

val encoding = "utf8"
val datasetName = "InsuranceFraud"
val datasetSeparator = ";"

val sc = new SparkContext("local", "DataValidation Application")

val DataValidation = new DataValidation(datasetPath, datasetName,
datasetSeparator, encoding, sc)

The error i get is

Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
    - object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@2aa98145)
    - field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$DataValidation, name: sc, type: class org.apache.spark.SparkContext)
    - object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$DataValidation, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$DataValidation@3d93cd9c)
    - field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$DataValidation$$anonfun$1, name: $outer, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$DataValidation)
    - object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$DataValidation$$anonfun$1, <function1>)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312)
    ... 87 more

PLEASE Note the following behavior

  1. If I change a statement in the class as follows then everything works.

    val separatorCount = rdd.map(_.count(_ == ';'))

  2. If I inline the method i.e just execute the statements within the method and not have a method, everything works too.

1 Answer 1

2

Ive resolved this now..

Since Im using separator in my map function, serialization is attempted on the whole class. However, the method textile_encoding cannot be serialized causing the error.

So, I've moved this method to a separate class and instantiated it externally and passed it to this class.

Now serialization is fine.

When you have this problem you have three solutions I think:-

  1. Do what I did. Move method to a different class
  2. Write your own closure/serialisation. (Don't know how yet)
  3. Pre-Serialise the offending method (Don't know how yet)

Regards

Amer

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.