4

Here is the spark DataFrame I want to save as a csv.

type(MyDataFrame)
--Output: <class 'pyspark.sql.dataframe.DataFrame'>

To save this as a CSV, I have the following code:

MyDataFrame.write.csv(csv_path, mode = 'overwrite', header = 'true')

When I save this, the file name is something like this:

part-0000-766dfdf-78fg-aa44-as3434rdfgfg-c000.csv

Is there a way I can give this a custom name while saving it? Like "MyDataFrame.csv"

3 Answers 3

5

I have the same requirement.You can write to one path, and then change the file path. This is my solution.

def write_to_hdfs_specify_path(df, spark, hdfs_path, file_name):
    """
    :param df: dataframe which you want to save
    :param spark: sparkSession
    :param hdfs_path: target path(shoul be not exises)
    :param file_name: csv file name
    :return: 
    """
    sc = spark.sparkContext
    Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
    FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
    Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
    df.coalesce(1).write.option("header", True).option("delimiter", "|").option("compression", "none").csv(hdfs_path)
    fs = FileSystem.get(Configuration())
    file = fs.globStatus(Path("%s/part*" % hdfs_path))[0].getPath().getName()
    full_path = "%s/%s" % (hdfs_path, file_name)
    result = fs.rename(Path("%s/%s" % (hdfs_path, file)), Path(full_path))
    return result
Sign up to request clarification or add additional context in comments.

1 Comment

I am uploading it directly to S3, then on file = fs.globStatus, getting this error IllegalArgumentException: Wrong FS: s3a://my-bucket/2022/10, expected: file:///. Any workaround for this?
4

No. That's how Spark work (at least for now). You'd have MyDataFrame.csv as a directory name, and under that directory, you'd have multiple files with the same format as part-0000-766dfdf-78fg-aa44-as3434rdfgfg-c000.csv, part-0000-766dfdf-78fg-aa44-as3434rdfgfg-c001.csv etc

It's not recommended, but if your data is small enough (arguably what is "small enough" here), you can always convert it to Pandas and save it to a single CSV file with any name you wanted.

Comments

1

.coalesce(1) will guarantee that there is only 1 file but will not guarantee file name. Please use some temp directory to save it and than rename it and copy (using dbutils.fs functions if you use databricks or using FileUtil from Hadoop API).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.