How to save a PySpark dataframe as a CSV with custom file name?

Question

Here is the spark DataFrame I want to save as a csv.

type(MyDataFrame)
--Output: <class 'pyspark.sql.dataframe.DataFrame'>

To save this as a CSV, I have the following code:

MyDataFrame.write.csv(csv_path, mode = 'overwrite', header = 'true')

When I save this, the file name is something like this:

part-0000-766dfdf-78fg-aa44-as3434rdfgfg-c000.csv

Is there a way I can give this a custom name while saving it? Like "MyDataFrame.csv"

Junhua.xie · Accepted Answer · 2021-10-20 02:24:40Z

5

I have the same requirement.You can write to one path, and then change the file path. This is my solution.

def write_to_hdfs_specify_path(df, spark, hdfs_path, file_name):
    """
    :param df: dataframe which you want to save
    :param spark: sparkSession
    :param hdfs_path: target path(shoul be not exises)
    :param file_name: csv file name
    :return: 
    """
    sc = spark.sparkContext
    Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
    FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
    Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
    df.coalesce(1).write.option("header", True).option("delimiter", "|").option("compression", "none").csv(hdfs_path)
    fs = FileSystem.get(Configuration())
    file = fs.globStatus(Path("%s/part*" % hdfs_path))[0].getPath().getName()
    full_path = "%s/%s" % (hdfs_path, file_name)
    result = fs.rename(Path("%s/%s" % (hdfs_path, file)), Path(full_path))
    return result

edited Oct 20, 2021 at 2:24

answered Oct 20, 2021 at 2:17

Junhua.xie

1745 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

User Over a year ago

I am uploading it directly to S3, then on file = fs.globStatus, getting this error IllegalArgumentException: Wrong FS: s3a://my-bucket/2022/10, expected: file:///. Any workaround for this?

pltc · Accepted Answer · 2021-10-19 18:28:29Z

4

No. That's how Spark work (at least for now). You'd have MyDataFrame.csv as a directory name, and under that directory, you'd have multiple files with the same format as part-0000-766dfdf-78fg-aa44-as3434rdfgfg-c000.csv, part-0000-766dfdf-78fg-aa44-as3434rdfgfg-c001.csv etc

It's not recommended, but if your data is small enough (arguably what is "small enough" here), you can always convert it to Pandas and save it to a single CSV file with any name you wanted.

answered Oct 19, 2021 at 18:28

pltc

6,0371 gold badge16 silver badges32 bronze badges

Comments

Hubert Dudek · Accepted Answer · 2021-10-20 16:57:10Z

1

.coalesce(1) will guarantee that there is only 1 file but will not guarantee file name. Please use some temp directory to save it and than rename it and copy (using dbutils.fs functions if you use databricks or using FileUtil from Hadoop API).

answered Oct 20, 2021 at 16:57

Hubert Dudek

1,7303 gold badges15 silver badges21 bronze badges

Collectives™ on Stack Overflow

How to save a PySpark dataframe as a CSV with custom file name?

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related