Skip to main content
Stack Overflow for Teams is now Stack Internal: See how we’re powering the human intelligence layer of enterprise AI. Read more >
Filter by
Sorted by
Tagged with
1 vote
1 answer
60 views

Lets say I have a very large dataframe df. Now I have two smaller dataframes df1 and df2. df is getting joined with df1 based on key1. And df2 is getting joined with key2 and key3. Now I know salting ...
PAMPA ROY's user avatar
0 votes
1 answer
83 views

spark is lazy evaluated, so how does rdd.getNumPartitions() return the correct partition value BEFORE the action is called? df1 = read_file('s3file1') df2 = read_file('file2') print('df1 ...
kyl's user avatar
  • 580
0 votes
1 answer
80 views

I have a job running on AWS Glue 3.0 with G.8x worker. I am using 100 workers configuration. In recent runs, count() was causing OOM and I figured out repartitioning might help. I read we have to keep ...
Vijeth Kashyap's user avatar
1 vote
0 answers
90 views

i am new to the GCP cloud and has probably a very basic question. We are running our PySpark jobs in Dataproc ephemeral cluster with autoscaling property on for the cluster. In our code we have used ...
Kaushik Ghosh's user avatar
0 votes
0 answers
320 views

I am running a SPARK job and for the most it goes fast but at the last task, it gets stuck in one of the stages. I can see there is a lot more shuffle read/rows happening for that task and tried a ...
user23202697's user avatar
1 vote
0 answers
378 views

Suppose we are using Spark on top of Hive, specifically the SQL API. Now suppose we have a table A with two partition columns, part1 and part2 and that we are insert overwriting into A with dynamic ...
aaa's user avatar
  • 105
0 votes
0 answers
45 views

I am trying to migrate a query from SQL Server to Spark SQL. It is running fine on SQL Server but having issues in Spark SQL. I found that spark SQL does not support sub queries But I am avoiding to ...
jarry jafery's user avatar
  • 1,036
5 votes
2 answers
8k views

I'm running into an issue with a Spark job that fails roughly every 2nd time with the following error message: org.apache.spark.SparkException: Job aborted due to stage failure: A shuffle map stage ...
Martin Studer's user avatar
0 votes
2 answers
644 views

repartition() creates partition in memory and is used as a read() operation. partitionBy() creates partition in disk and is used as a write operation. How can we confirm there is multiple files in ...
Blue Clouds's user avatar
  • 8,372
1 vote
0 answers
423 views

I am applying a pandas UDF to a grouped dataframe in databricks. When I do this, a couple tasks hang forever, while the rest complete quickly. I start by repartitioning my dataset so that each group ...
Gary's user avatar
  • 11
0 votes
1 answer
2k views

I have a requirement, where I have a huge dataset of over 2 Trillion records. This comes as a result of some join. And post this join, I need to aggregate on a column ('id' column) and get a list of ...
Praveen Kumar B N's user avatar
0 votes
1 answer
497 views

Is it possible to export published SQL files in your Synapse workspace to your sandbox environment via code and without the use of pipelines? If not is it somehow possible to access your published SQL ...
ByronSchuurman's user avatar
0 votes
0 answers
162 views

I am trying to convert bit large file 34GB fixed width file into structural format using pySpark, But my job taking too long to complete (Almost 10 hr+), File having large line almost 50K characters ...
Sanjay Bagal's user avatar
0 votes
1 answer
1k views

can someone explain to me how Spark determines the number of tasks when reading data? How is it related with the number of partitions of the input file and the number of cores? I have a dataset (91MB) ...
Pawel's user avatar
  • 1
2 votes
1 answer
4k views

As per the documentation: spark.default.parallelism:Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user spark.default....
figs_and_nuts's user avatar
3 votes
1 answer
1k views

What I understand is When we repartition any dataframe with value n, data will continue to remain on those n partitions, until you hit any shuffle stages or other value of repartition or coalesce. For ...
Rushabh Gujarathi's user avatar
0 votes
1 answer
306 views

Need to merge small parquet files. I have multiple small parquet files in hdfs. I like to combine those parquet files each to nearly 128 mb each 2. So I read all the files using spark.read() And did ...
pavan kumar's user avatar
0 votes
0 answers
362 views

I have 2 super large tables which I am loading as dataframe in parquet format with one join key. Now the issues I need help in : I need to tune it, as I am getting OOM errors due to Java heap space. ...
Red Maple's user avatar
0 votes
1 answer
412 views

I have a performance problem in repartition and partitionBy operation in Spark. My df is containing monthly data and i am partitioning data as daily with dailyDt column. My code is like below. First ...
scalactic's user avatar
0 votes
1 answer
260 views

In spark, if we execute the following command: spark.sql("select * from parquet.`/Users/MyUser/TEST/testcompression/part-00009-asdfasdf-e829-421d-b14f-asdfasdf.c000.snappy.parquet`") .show(...
sojim2's user avatar
  • 1,317
1 vote
0 answers
254 views

Here is my spark job stages: It has 260000 tasks because the job rely on more then 200000 small hdfs files, each file about 50MB and it is stored in gzip format I tried using the following settings ...
xyfs's user avatar
  • 11
0 votes
1 answer
969 views

I have 3 input files File1 - 27gb File2 - 3gb File3 - 12mb My cluster configuration 2 executor Each executor has 2 cores Executor memory - 13gb (2gb overhead) The transformation that I'm going to ...
Praveen Kumar's user avatar
0 votes
0 answers
117 views

I have a huge amount of data in a few oracle tables (the total size of data in these tables is around 50GB). I have to perform joins and perform some calculations, and these tables don't have any ...
Sidhant Gupta's user avatar
2 votes
1 answer
1k views

I am creating a jdbc object in spark driver and I am using that in executor to access the db. So my concern is that is it the same connection object or executors would get a copy of connection object ...
Suparn Lele's user avatar
1 vote
1 answer
579 views

There are two dataframes df and df1 Then, let's consider 3 cases: df1 only has the same number of rows as df df1 has the same number of rows as df and, the same number of partitions as df. Think df....
figs_and_nuts's user avatar
2 votes
2 answers
988 views

Suppose I have a dataframe of 10GB with one of the column's "c1" having same value for every record. Each single partition is maximum 128 MB(default value). Suppose i call repartition($"...
Arjunlal M.A's user avatar
3 votes
3 answers
3k views

I am trying to understand the difference between coalesce() and repartition(). If I correctly understood this answer, coalesce() can only reduce number of partitions of dataframe and if we try to ...
Niketa's user avatar
  • 563