Pyspark write a DataFrame to csv files in S3 with a custom name

Question

I am writing files to an S3 bucket with code such as the following:

df.write.format('csv').option('header','true').mode("append").save("s3://filepath")

This outputs to the S3 bucket as several files as desired, but each part has a long file name such as:

part-00019-tid-5505901395380134908-d8fa632e-bae4-4c7b-9f29-c34e9a344680-236-1-c000.csv

Is there a way to write this as a custom file name, preferably in the PySpark write function? Such as:

part-00019-my-output.csv

pltc · Accepted Answer · 2022-03-03 22:11:17Z

5

You can't do that with only Spark. The long random numbers behind are to make sure there is no duplication, no overwriting would happen when there are many many executors trying to write files at the same location.

You'd have to use AWS SDK to rename those files.

P/S: If you want one single CSV file, you can use coalesce. But the file name is still not determinable.

df.coalesce(1).write.format('csv')...

answered Mar 3, 2022 at 22:11

pltc

6,0371 gold badge16 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pyspark write a DataFrame to csv files in S3 with a custom name

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related