27 questions
1
vote
1
answer
60
views
Pyspark Dataframe repartition strategy
Lets say I have a very large dataframe df. Now I have two smaller dataframes df1 and df2.
df is getting joined with df1 based on key1. And df2 is getting joined with key2 and key3.
Now I know salting ...
0
votes
1
answer
83
views
Does rdd.getNumPartitions() always have the right repartition number before an action?
spark is lazy evaluated, so how does rdd.getNumPartitions() return the correct partition value BEFORE the action is called?
df1 = read_file('s3file1')
df2 = read_file('file2')
print('df1 ...
0
votes
1
answer
80
views
AWS Glue 3.0: Partition Count changing by itself even after repartition
I have a job running on AWS Glue 3.0 with G.8x worker. I am using 100 workers configuration.
In recent runs, count() was causing OOM and I figured out repartitioning might help.
I read we have to keep ...
1
vote
0
answers
90
views
Selecting a Dataproc Cluster Size with autoscaling ON
i am new to the GCP cloud and has probably a very basic question. We are running our PySpark jobs in Dataproc ephemeral cluster with autoscaling property on for the cluster. In our code we have used ...
0
votes
0
answers
320
views
Last SPARK Task taking forever to complete
I am running a SPARK job and for the most it goes fast but at the last task, it gets stuck in one of the stages. I can see there is a lot more shuffle read/rows happening for that task and tried a ...
1
vote
0
answers
378
views
Spark SQL repartition before insert operation
Suppose we are using Spark on top of Hive, specifically the SQL API.
Now suppose we have a table A with two partition columns, part1 and part2 and that we are insert overwriting into A with dynamic ...
0
votes
0
answers
45
views
Spark SQL correlated subquery not identifying parent columns
I am trying to migrate a query from SQL Server to Spark SQL. It is running fine on SQL Server but having issues in Spark SQL. I found that spark SQL does not support sub queries But I am avoiding to ...
5
votes
2
answers
8k
views
Shuffle map stage failure with indeterminate output: eliminate the indeterminacy by checkpointing the RDD before repartition
I'm running into an issue with a Spark job that fails roughly every 2nd time with the following error message:
org.apache.spark.SparkException: Job aborted due to stage failure: A
shuffle map stage ...
0
votes
2
answers
644
views
repartition in memory vs file
repartition() creates partition in memory and is used as a read() operation. partitionBy() creates partition in disk and is used as a write operation.
How can we confirm there is multiple files in ...
1
vote
0
answers
423
views
Hanging Task in Databricks
I am applying a pandas UDF to a grouped dataframe in databricks. When I do this, a couple tasks hang forever, while the rest complete quickly.
I start by repartitioning my dataset so that each group ...
0
votes
1
answer
2k
views
If I repartition by column name does spark understand that it is repartitioned by that column when it is read back
I have a requirement, where I have a huge dataset of over 2 Trillion records. This comes as a result of some join. And post this join, I need to aggregate on a column ('id' column) and get a list of ...
0
votes
1
answer
497
views
How to export SQL files in Synapse to sandbox environment or directly access these SQL files via notebooks?
Is it possible to export published SQL files in your Synapse workspace to your sandbox environment via code and without the use of pipelines?
If not is it somehow possible to access your published SQL ...
0
votes
0
answers
162
views
PySpark Performance slow in Reading large fixed width file with long lines to convert to structural
I am trying to convert bit large file 34GB fixed width file into structural format using pySpark, But my job taking too long to complete (Almost 10 hr+), File having large line almost 50K characters ...
0
votes
1
answer
1k
views
Spark number of input partitions vs number of reading tasks
can someone explain to me how Spark determines the number of tasks when reading data? How is it related with the number of partitions of the input file and the number of cores?
I have a dataset (91MB) ...
2
votes
1
answer
4k
views
understanding spark.default.parallelism
As per the documentation:
spark.default.parallelism:Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user
spark.default....
3
votes
1
answer
1k
views
What is the difference between spark.shuffle.partition and spark.repartition in spark?
What I understand is
When we repartition any dataframe with value n, data will continue to remain on those n partitions, until you hit any shuffle stages or other value of repartition or coalesce.
For ...
0
votes
1
answer
306
views
spark repartition issue for filesize
Need to merge small parquet files.
I have multiple small parquet files in hdfs.
I like to combine those parquet files each to nearly 128 mb each
2. So I read all the files using spark.read()
And did ...
0
votes
0
answers
362
views
Join 2 large size tables (50 Gb and 1 billion records)
I have 2 super large tables which I am loading as dataframe in parquet format with one join key. Now the issues I need help in :
I need to tune it, as I am getting OOM errors due to Java heap space.
...
0
votes
1
answer
412
views
How to Increase Spark Repartition With Column Expressions Performance
I have a performance problem in repartition and partitionBy operation in Spark.
My df is containing monthly data and i am partitioning data as daily with dailyDt column. My code is like below.
First ...
0
votes
1
answer
260
views
How to read parquet files using only one thread on a worker/task node?
In spark, if we execute the following command:
spark.sql("select * from parquet.`/Users/MyUser/TEST/testcompression/part-00009-asdfasdf-e829-421d-b14f-asdfasdf.c000.snappy.parquet`")
.show(...
1
vote
0
answers
254
views
How can I reduce the spark tasks when I run a spark job
Here is my spark job stages:
It has 260000 tasks because the job rely on more then 200000 small hdfs files, each file about
50MB and it is stored in gzip format
I tried using the following settings ...
0
votes
1
answer
969
views
How to choose the optimal repartition value in spark
I have 3 input files
File1 - 27gb
File2 - 3gb
File3 - 12mb
My cluster configuration
2 executor
Each executor has 2 cores
Executor memory - 13gb (2gb overhead)
The transformation that I'm going to ...
0
votes
0
answers
117
views
using repartion in pyspark for huge set of data
I have a huge amount of data in a few oracle tables (the total size of data in these tables is around 50GB). I have to perform joins and perform some calculations, and these tables don't have any ...
2
votes
1
answer
1k
views
Apache Spark - passing jdbc connection object to executors
I am creating a jdbc object in spark driver and I am using that in executor to access the db. So my concern is that is it the same connection object or executors would get a copy of connection object ...
1
vote
1
answer
579
views
How does pyspark repartition work without column name specified?
There are two dataframes df and df1
Then, let's consider 3 cases:
df1 only has the same number of rows as df
df1 has the same number of rows as df and, the same number of partitions as df. Think df....
2
votes
2
answers
988
views
Apache Spark What happens when repartition($"key") is called when size of all records per key is greater than the size of a single partition?
Suppose I have a dataframe of 10GB with one of the column's "c1" having same value for every record. Each single partition is maximum 128 MB(default value). Suppose i call repartition($"...
3
votes
3
answers
3k
views
Can Coalesce increase partitions of Spark DataFrame
I am trying to understand the difference between coalesce() and repartition().
If I correctly understood this answer, coalesce() can only reduce number of partitions of dataframe and if we try to ...