SparkContext parallelize invocation example in java

Question

Am getting started with Spark, and ran into issue trying to implement the simple example for map function. The issue is with the definition of 'parallelize' in the new version of Spark. Can someone share example of how to use it, since the following way is giving error for insufficient arguments.

Spark Version : 2.3.2 Java : 1.8

SparkSession session = SparkSession.builder().appName("Compute Square of Numbers").config("spark.master","local").getOrCreate();
SparkContext context = session.sparkContext();
List<Integer> seqNumList = IntStream.rangeClosed(10, 20).boxed().collect(Collectors.toList());
JavaRDD<Integer> numRDD = context.parallelize(seqNumList, 2);

Compiletime Error Message : The method expects 3 arguments

I do not get what the 3rd argument should be like? As per the documentation, it's supposed to be

scala.reflect.ClassTag<T>

But how to even define or use it?

Please do not suggest using JavaSparkContext, as i wanted to know how to get this approach to work with using generic SparkContext.

Ref : https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/SparkContext.html#parallelize-scala.collection.Seq-int-scala.reflect.ClassTag-

did you read the JavaDoc?? the method takes 3 arguments: spark.apache.org/docs/2.3.0/api/java/org/apache/spark/… — UninformedUser
– UninformedUser, Commented Nov 4, 2018 at 16:11
and it also needs a Scala sequence: JavaConversions.asScalaBuffer(seqNumList) - honestly, I don't know hwy you want to call this method in Java ... that's weird — UninformedUser
– UninformedUser, Commented Nov 4, 2018 at 16:14
@AKSW thanks for the ClassTag comment. That is what was looking for. I understand that this is not the optimal way, but wanted to understand how to use parallelize with SparkContext, since that is the new uniform way of accessing Spark processing. Or you suggest to fall back to JavaSparkContext for these tasks still? — Saurabh Mishra
– Saurabh Mishra, Commented Nov 4, 2018 at 16:16

Saurabh Mishra · Accepted Answer · 2018-11-04 16:21:16Z

3

Here is the code which worked for me finally. Not the best way to achieve the result, but was a way to explore the API for me

SparkSession session = SparkSession.builder().appName("Compute Square of Numbers") .config("spark.master", "local").getOrCreate();

SparkContext context = session.sparkContext();

List<Integer> seqNumList = IntStream.rangeClosed(10, 20).boxed().collect(Collectors.toList());


RDD<Integer> numRDD = context
        .parallelize(JavaConverters.asScalaIteratorConverter(seqNumList.iterator()).asScala()
                .toSeq(), 2, scala.reflect.ClassTag$.MODULE$.apply(Integer.class));


numRDD.toJavaRDD().foreach(x -> System.out.println(x));
session.stop();

answered Nov 4, 2018 at 16:21

Saurabh Mishra

1,7133 gold badges18 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Sapience · Accepted Answer · 2022-09-15 17:01:04Z

1

If you don't want to deal with providing the extra two parameters using sparkConext, you can also use JavaSparkContext.parallelize(), which only needs an input list:

import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.rdd.RDD;

JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
final RDD<Integer> rdd = jsc.parallelize(seqNumList).map(num -> {
    // your implementation
}).rdd();

answered Sep 15, 2022 at 17:01

Sapience

1,6383 gold badges16 silver badges25 bronze badges

Collectives™ on Stack Overflow

SparkContext parallelize invocation example in java

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related