12

I am trying to export data from a spark dataframe to .csv file:

df.coalesce(1)\
  .write\
  .format("com.databricks.spark.csv")\
  .option("header", "true")\
  .save(output_path)

It is creating a file name "part-r-00001-512872f2-9b51-46c5-b0ee-31d626063571.csv"

I want the filename to be "part-r-00000.csv" or "part-00000.csv"

As the file is being created on AWS S3, I am limited in how I can use os.system commands.

How can I set the file name while keeping the header in the file?

Thanks!

5
  • Does it matter what it's called on HDFS if you can pull it down and name it whatever you like? hadoop fs -cat path/to/output/part-r-* > path/to/local/file.csv will dump all the parts from Hadoop into one file on your local disk. Commented Feb 6, 2018 at 21:17
  • I'm trying to build an automated pyspark etl job, hence I'm avoiding all sorts of manual and local system interventions. my ask is simple 'when exporting data from a spark data frame to .csv file, the file name should be 'part-00000'. appreciate your response. Commented Feb 6, 2018 at 21:59
  • I've never seen this type of filename before. Maybe the file already exists and this long name was made to avoid overwriting the file? Commented Feb 6, 2018 at 22:00
  • no, I did cleanup the target folder and reran the job. that didn't help. Commented Feb 6, 2018 at 22:09
  • AFAIK, the file name is part-#{partition number}-#{random uuid}-#{something}. AFAIK, the UUID is to allow multiple executors to write to the same directory without worrying about trying to write to the same file. Commented Dec 28, 2018 at 1:33

1 Answer 1

9

Well, though I've got -3 rating for my question, here I'm posting the solution which helped me addressing the problem. Me being a techie, always bother more about code / logic than looking into grammar. At least for me, a small context should do to understand the problem.

Coming to the solution:

When we create a .csv file from spark dataframe,

The output file is by default named part-x-yyyyy where:

1) x is either 'm' or 'r', depending on whether the job was a map only job, or reduce 2) yyyyy is the mapper or reducer task number, either it can be 00000 or a random number.

In order to rename the output file, running an os.system HDFS command should do.

import os, sys
output_path_stage = //set the source folder path here
output_path  = // set the target folder path here
//creating system command line
cmd2 = "hdfs dfs -mv " + output_path_stage + 'part-*' + '  ' + output_path + 'new_name.csv'
//executing system command
os.system(cmd2)

fyi, if we use rdd.saveAsTextFile option, file gets created with no header. If we use coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save(output_path) , file gets created with a random part-x name. above solution will help us creating a .csv file with header, delimiter along with required file name.

Sign up to request clarification or add additional context in comments.

4 Comments

Hey dude, sorry people dinged you for grammar. That's bad behavior on their part.
Hmm. Shoot. I came across your question and was hoping to be able to set it, rather than use commands to rename on the filesystem. Time to go digging!
It's not looking like it's possible with the libraries as-is, but people point to this for more info about adding this functionality: wiki.apache.org/hadoop/…
Would you be able to locate the output file and rename it? e.g. use glob to get the file and then rename or create a copy of it with the intended filename?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.