4

I'm getting the Task not serializable error in Spark. I've searched and tried to use a static function as suggested in some posts, but it still gives the same error.

Code is as below:

public class Rating implements Serializable {
    private SparkSession spark;
    private SparkConf sparkConf;
    private JavaSparkContext jsc;
    private static Function<String, Rating> mapFunc;

    public Rating() {
        mapFunc = new Function<String, Rating>() {
            public Rating call(String str) {
                return Rating.parseRating(str);
            }
        };
    }

    public void runProcedure() { 
        sparkConf = new SparkConf().setAppName("Filter Example").setMaster("local");
        jsc = new JavaSparkContext(sparkConf);
        SparkSession spark = SparkSession.builder().master("local").appName("Word Count")
            .config("spark.some.config.option", "some-value").getOrCreate();        

        JavaRDD<Rating> ratingsRDD = spark.read().textFile("sample_movielens_ratings.txt")
                .javaRDD()
                .map(mapFunc);
    }

    public static void main(String[] args) {
        Rating newRating = new Rating();
        newRating.runProcedure();
    }
}

The error gives: enter image description here

How do I solve this error? Thanks in advance.

2 Answers 2

15

Clearly Rating cannot be Serializable, because it contains references to Spark structures (i.e. SparkSession, SparkConf, etc.) as attributes.

The problem here is in

JavaRDD<Rating> ratingsRD = spark.read().textFile("sample_movielens_ratings.txt")
            .javaRDD()
            .map(mapFunc);

If you look at the definition of mapFunc, you're returning a Rating object.

mapFunc = new Function<String, Rating>() {
    public Rating call(String str) {
        return Rating.parseRating(str);
    }
};

This function is used inside a map (a transformation in Spark terms). Because the transformations are executed directly into the worker nodes and not in the driver node, their code must be serializable. This forces Spark to try serialize the Rating class, but it is not possible.

Try to extract the features you need from Rating, and placing them in a different class that does not own any Spark structure. Finally, use this new class as return type of your mapFunc function.

Sign up to request clarification or add additional context in comments.

1 Comment

Separating Rating and the procedure into two classes worked! Thanks :)
5

In addition you have to be sure to not include non-serializable variables in your class like JavaSparkContext and SparkSession. if you need to include them you should define like this:

private transient JavaSparkContext sparkCtx;
private transient SparkSession spark;

Good luck.

1 Comment

This little nugget just saved me. Until seeing this, I just couldn't get a clear idea of what was going wrong. So after working in circles for a good day, I finally get it. Thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.