Newest 'apache-spark-sql-repartition' Questions

1 vote

1 answer

60 views

Pyspark Dataframe repartition strategy

Lets say I have a very large dataframe df. Now I have two smaller dataframes df1 and df2. df is getting joined with df1 based on key1. And df2 is getting joined with key2 and key3. Now I know salting ...

PAMPA ROY

71

asked May 2 at 19:22

0 votes

1 answer

83 views

Does rdd.getNumPartitions() always have the right repartition number before an action?

spark is lazy evaluated, so how does rdd.getNumPartitions() return the correct partition value BEFORE the action is called? df1 = read_file('s3file1') df2 = read_file('file2') print('df1 ...

kyl

580

asked Apr 21 at 22:08

0 votes

1 answer

80 views

AWS Glue 3.0: Partition Count changing by itself even after repartition

I have a job running on AWS Glue 3.0 with G.8x worker. I am using 100 workers configuration. In recent runs, count() was causing OOM and I figured out repartitioning might help. I read we have to keep ...

Vijeth Kashyap

333

asked Jul 31, 2024 at 13:29

1 vote

0 answers

90 views

Selecting a Dataproc Cluster Size with autoscaling ON

i am new to the GCP cloud and has probably a very basic question. We are running our PySpark jobs in Dataproc ephemeral cluster with autoscaling property on for the cluster. In our code we have used ...

Kaushik Ghosh

131

asked Jun 4, 2024 at 6:06

0 votes

0 answers

320 views

Last SPARK Task taking forever to complete

I am running a SPARK job and for the most it goes fast but at the last task, it gets stuck in one of the stages. I can see there is a lot more shuffle read/rows happening for that task and tried a ...

user23202697

1

asked Mar 29, 2024 at 6:31

1 vote

0 answers

378 views

Spark SQL repartition before insert operation

Suppose we are using Spark on top of Hive, specifically the SQL API. Now suppose we have a table A with two partition columns, part1 and part2 and that we are insert overwriting into A with dynamic ...

aaa

105

asked Mar 25, 2024 at 23:33

0 votes

0 answers

45 views

Spark SQL correlated subquery not identifying parent columns

I am trying to migrate a query from SQL Server to Spark SQL. It is running fine on SQL Server but having issues in Spark SQL. I found that spark SQL does not support sub queries But I am avoiding to ...

jarry jafery

1,036

asked Feb 28, 2024 at 6:11

5 votes

2 answers

8k views

Shuffle map stage failure with indeterminate output: eliminate the indeterminacy by checkpointing the RDD before repartition

I'm running into an issue with a Spark job that fails roughly every 2nd time with the following error message: org.apache.spark.SparkException: Job aborted due to stage failure: A shuffle map stage ...

Martin Studer

2,351

asked Oct 7, 2023 at 14:31

0 votes

2 answers

644 views

repartition in memory vs file

repartition() creates partition in memory and is used as a read() operation. partitionBy() creates partition in disk and is used as a write operation. How can we confirm there is multiple files in ...

Blue Clouds

8,372

asked Jul 13, 2023 at 6:24

1 vote

0 answers

423 views

Hanging Task in Databricks

I am applying a pandas UDF to a grouped dataframe in databricks. When I do this, a couple tasks hang forever, while the rest complete quickly. I start by repartitioning my dataset so that each group ...

Gary

11

asked Jun 12, 2023 at 18:34

0 votes

1 answer

2k views

If I repartition by column name does spark understand that it is repartitioned by that column when it is read back

I have a requirement, where I have a huge dataset of over 2 Trillion records. This comes as a result of some join. And post this join, I need to aggregate on a column ('id' column) and get a list of ...

Praveen Kumar B N

149

asked Jun 9, 2023 at 12:28

0 votes

1 answer

497 views

How to export SQL files in Synapse to sandbox environment or directly access these SQL files via notebooks?

Is it possible to export published SQL files in your Synapse workspace to your sandbox environment via code and without the use of pipelines? If not is it somehow possible to access your published SQL ...

ByronSchuurman

143

asked May 25, 2023 at 10:26

0 votes

0 answers

162 views

PySpark Performance slow in Reading large fixed width file with long lines to convert to structural

I am trying to convert bit large file 34GB fixed width file into structural format using pySpark, But my job taking too long to complete (Almost 10 hr+), File having large line almost 50K characters ...

Sanjay Bagal

119

asked Mar 3, 2023 at 3:29

0 votes

1 answer

1k views

Spark number of input partitions vs number of reading tasks

can someone explain to me how Spark determines the number of tasks when reading data? How is it related with the number of partitions of the input file and the number of cores? I have a dataset (91MB) ...

Pawel

1

asked Jan 21, 2023 at 11:24

2 votes

1 answer

4k views

understanding spark.default.parallelism

As per the documentation: spark.default.parallelism:Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user spark.default....

figs_and_nuts

5,881

asked Dec 11, 2022 at 16:51

3 votes

1 answer

1k views

What is the difference between spark.shuffle.partition and spark.repartition in spark?

What I understand is When we repartition any dataframe with value n, data will continue to remain on those n partitions, until you hit any shuffle stages or other value of repartition or coalesce. For ...

Rushabh Gujarathi

146

asked Dec 10, 2022 at 5:52

0 votes

1 answer

306 views

spark repartition issue for filesize

Need to merge small parquet files. I have multiple small parquet files in hdfs. I like to combine those parquet files each to nearly 128 mb each 2. So I read all the files using spark.read() And did ...

pavan kumar

1

asked Nov 24, 2022 at 14:22

0 votes

0 answers

362 views

Join 2 large size tables (50 Gb and 1 billion records)

I have 2 super large tables which I am loading as dataframe in parquet format with one join key. Now the issues I need help in : I need to tune it, as I am getting OOM errors due to Java heap space. ...

Red Maple

1

asked Nov 21, 2022 at 18:02

0 votes

1 answer

412 views

How to Increase Spark Repartition With Column Expressions Performance

I have a performance problem in repartition and partitionBy operation in Spark. My df is containing monthly data and i am partitioning data as daily with dailyDt column. My code is like below. First ...

scalactic

31

asked Nov 15, 2022 at 19:48

0 votes

1 answer

260 views

How to read parquet files using only one thread on a worker/task node?

In spark, if we execute the following command: spark.sql("select * from parquet.`/Users/MyUser/TEST/testcompression/part-00009-asdfasdf-e829-421d-b14f-asdfasdf.c000.snappy.parquet`") .show(...

sojim2

1,317

asked Nov 14, 2022 at 22:55

1 vote

0 answers

254 views

How can I reduce the spark tasks when I run a spark job

Here is my spark job stages: It has 260000 tasks because the job rely on more then 200000 small hdfs files, each file about 50MB and it is stored in gzip format I tried using the following settings ...

xyfs

11

asked Nov 14, 2022 at 11:50

0 votes

1 answer

969 views

How to choose the optimal repartition value in spark

I have 3 input files File1 - 27gb File2 - 3gb File3 - 12mb My cluster configuration 2 executor Each executor has 2 cores Executor memory - 13gb (2gb overhead) The transformation that I'm going to ...

Praveen Kumar

1

asked Sep 26, 2022 at 20:47

0 votes

0 answers

117 views

using repartion in pyspark for huge set of data

I have a huge amount of data in a few oracle tables (the total size of data in these tables is around 50GB). I have to perform joins and perform some calculations, and these tables don't have any ...

Sidhant Gupta

149

asked Mar 30, 2022 at 8:30

2 votes

1 answer

1k views

Apache Spark - passing jdbc connection object to executors

I am creating a jdbc object in spark driver and I am using that in executor to access the db. So my concern is that is it the same connection object or executors would get a copy of connection object ...

Suparn Lele

31

asked Mar 5, 2022 at 20:01

1 vote

1 answer

579 views

How does pyspark repartition work without column name specified?

There are two dataframes df and df1 Then, let's consider 3 cases: df1 only has the same number of rows as df df1 has the same number of rows as df and, the same number of partitions as df. Think df....

figs_and_nuts

5,881

asked Feb 7, 2022 at 8:55

2 votes

2 answers

988 views

Apache Spark What happens when repartition($"key") is called when size of all records per key is greater than the size of a single partition?

Suppose I have a dataframe of 10GB with one of the column's "c1" having same value for every record. Each single partition is maximum 128 MB(default value). Suppose i call repartition($"...

Arjunlal M.A

161

asked Sep 23, 2021 at 10:15

3 votes

3 answers

3k views

Can Coalesce increase partitions of Spark DataFrame

I am trying to understand the difference between coalesce() and repartition(). If I correctly understood this answer, coalesce() can only reduce number of partitions of dataframe and if we try to ...

Niketa

563

asked Sep 27, 2019 at 6:57

Collectives™ on Stack Overflow

Pyspark Dataframe repartition strategy

Does rdd.getNumPartitions() always have the right repartition number before an action?

AWS Glue 3.0: Partition Count changing by itself even after repartition

Selecting a Dataproc Cluster Size with autoscaling ON

Last SPARK Task taking forever to complete

Spark SQL repartition before insert operation

Spark SQL correlated subquery not identifying parent columns

Shuffle map stage failure with indeterminate output: eliminate the indeterminacy by checkpointing the RDD before repartition

repartition in memory vs file

Hanging Task in Databricks

If I repartition by column name does spark understand that it is repartitioned by that column when it is read back

How to export SQL files in Synapse to sandbox environment or directly access these SQL files via notebooks?

PySpark Performance slow in Reading large fixed width file with long lines to convert to structural

Spark number of input partitions vs number of reading tasks

understanding spark.default.parallelism

What is the difference between spark.shuffle.partition and spark.repartition in spark?

spark repartition issue for filesize

Join 2 large size tables (50 Gb and 1 billion records)

How to Increase Spark Repartition With Column Expressions Performance

How to read parquet files using only one thread on a worker/task node?

How can I reduce the spark tasks when I run a spark job

How to choose the optimal repartition value in spark

using repartion in pyspark for huge set of data

Apache Spark - passing jdbc connection object to executors

How does pyspark repartition work without column name specified?

Apache Spark What happens when repartition($"key") is called when size of all records per key is greater than the size of a single partition?

Can Coalesce increase partitions of Spark DataFrame

Hot Network Questions