How to run a spark program in Java in parallel

Question

So I have a java application that has spark maven dependencies and on running it, it launches spark server on the host where its run. The server instance has 36 cores. I am specifying SparkSession instance where I am mentioning the number of cores and other config properties in parallel but when I see the stats using htop, it doesn't seem to use all the cores but just 1.

   SparkSession spark  = SparkSession
                .builder()
                .master("local")
                .appName("my-spark")
                .config("spark.driver.memory","50g")
                .config("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
                .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
                .config("spark.sql.shuffle.partitions", "400")
                .config("spark.eventLog.enabled", "true")
                .config("spark.eventLog.dir", "/dir1/dir2/logs")
                .config("spark.history.fs.logDirectory", "/dir1/dir2/logs")
                .config("spark.executor.cores", "36")

I also added in JavaSparkContext as well:

JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
sc.hadoopConfiguration().set("fs.s3a.access.key", AWS_KEY);
sc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_KEY);
sc.hadoopConfiguration().set("spark.driver.memory","50g");
sc.hadoopConfiguration().set("spark.eventLog.enabled", "true");
sc.hadoopConfiguration().set("spark.eventLog.dir", "/dir1/dir2/logs");
sc.hadoopConfiguration().set("spark.executor.cores", "36");

My task is reading data from aws s3 into a df and writing data in another bucket.

Dataset<Row> df = spark.read().format("csv").option("header", "true").load("s3a://bucket/file.csv.gz");
        //df = df.repartition(200);

        df.withColumn("col_name", df.col("col_name")).sort("col_name", "_id").write().format("iceberg").mode("append").save(location);

stevel · Accepted Answer · 2018-10-06 15:23:44Z

1

.gz files are "unspittable": to decompress them you have to start at byte 0 and read forwards. As a result, spark, hive, MapReduce, etc, give the whole file to a single worker. If you want parallel processing, use a different compression format (e.g. snappy)

answered Oct 6, 2018 at 15:23

stevel

13.6k1 gold badge41 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ged Over a year ago

But if you have many such files then you should be able to run partially in parallel. Or am I missing something?

stevel Over a year ago

each file can be given to an individual worker, yes. But your example load("s3a://bucket/file.csv.gz") isn't doing that, so gets a parallelism of "1"

Chitral Verma · Accepted Answer · 2018-10-05 22:53:15Z

0

You are running Spark in local mode, spark.executor.cores will not take effect, consider change .master("local") to .master("local[*]")

Hope this helps

answered Oct 5, 2018 at 22:53

Chitral Verma

2,8631 gold badge20 silver badges34 bronze badges

3 Comments

Atihska Over a year ago

Thanks for your reply. Not all 36 of them are used, still. And the memory I am specifying 50gb as I have 60 gb memory in total and it using 30 gbs. Does spark takes that config as upper limit?

Nikhil Over a year ago

From what I know, when you create a Spark session with the builder, you create a 'global' session. You can then create new sessions with spark.newSession() method. You may need this if you are reading multiple files simultaneously or the same file repeatedly for performing different operations. For each file read you can create a new Spark session with the newSession(). Each call newSession() creates a new thread.

Chitral Verma Over a year ago

@Atihska yes that config is absolute. for cores, can you check the vcores on nodemanager ui

Collectives™ on Stack Overflow

How to run a spark program in Java in parallel

2 Answers 2

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related