Exporting spark dataframe to .csv with header and specific filename

Question

I am trying to export data from a spark dataframe to .csv file:

df.coalesce(1)\
  .write\
  .format("com.databricks.spark.csv")\
  .option("header", "true")\
  .save(output_path)

It is creating a file name "part-r-00001-512872f2-9b51-46c5-b0ee-31d626063571.csv"

I want the filename to be "part-r-00000.csv" or "part-00000.csv"

As the file is being created on AWS S3, I am limited in how I can use os.system commands.

How can I set the file name while keeping the header in the file?

Thanks!

Does it matter what it's called on HDFS if you can pull it down and name it whatever you like? hadoop fs -cat path/to/output/part-r-* > path/to/local/file.csv will dump all the parts from Hadoop into one file on your local disk. — pault
– pault, Commented Feb 6, 2018 at 21:17
I'm trying to build an automated pyspark etl job, hence I'm avoiding all sorts of manual and local system interventions. my ask is simple 'when exporting data from a spark data frame to .csv file, the file name should be 'part-00000'. appreciate your response. — Naresh Y
– Naresh Y, Commented Feb 6, 2018 at 21:59
I've never seen this type of filename before. Maybe the file already exists and this long name was made to avoid overwriting the file? — pault
– pault, Commented Feb 6, 2018 at 22:00
no, I did cleanup the target folder and reran the job. that didn't help. — Naresh Y
– Naresh Y, Commented Feb 6, 2018 at 22:09
AFAIK, the file name is part-#{partition number}-#{random uuid}-#{something}. AFAIK, the UUID is to allow multiple executors to write to the same directory without worrying about trying to write to the same file. — Narfanator
– Narfanator, Commented Dec 28, 2018 at 1:33

Naresh Y · Accepted Answer · 2018-02-07 16:30:00Z

9

Well, though I've got -3 rating for my question, here I'm posting the solution which helped me addressing the problem. Me being a techie, always bother more about code / logic than looking into grammar. At least for me, a small context should do to understand the problem.

Coming to the solution:

When we create a .csv file from spark dataframe,

The output file is by default named part-x-yyyyy where:

1) x is either 'm' or 'r', depending on whether the job was a map only job, or reduce 2) yyyyy is the mapper or reducer task number, either it can be 00000 or a random number.

In order to rename the output file, running an os.system HDFS command should do.

import os, sys
output_path_stage = //set the source folder path here
output_path  = // set the target folder path here
//creating system command line
cmd2 = "hdfs dfs -mv " + output_path_stage + 'part-*' + '  ' + output_path + 'new_name.csv'
//executing system command
os.system(cmd2)

fyi, if we use rdd.saveAsTextFile option, file gets created with no header. If we use coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save(output_path) , file gets created with a random part-x name. above solution will help us creating a .csv file with header, delimiter along with required file name.

answered Feb 7, 2018 at 16:30

Naresh Y

2911 gold badge4 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Narfanator Over a year ago

Hey dude, sorry people dinged you for grammar. That's bad behavior on their part.

Narfanator Over a year ago

Hmm. Shoot. I came across your question and was hoping to be able to set it, rather than use commands to rename on the filesystem. Time to go digging!

Narfanator Over a year ago

It's not looking like it's possible with the libraries as-is, but people point to this for more info about adding this functionality: wiki.apache.org/hadoop/…

AK91 Over a year ago

Would you be able to locate the output file and rename it? e.g. use glob to get the file and then rename or create a copy of it with the intended filename?

Collectives™ on Stack Overflow

Exporting spark dataframe to .csv with header and specific filename

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related