Split file name into different columns of pyspark dataframe

Question

I am using pyspark SQL function input_file_name to add the input file name as a dataframe column.

df = df.withColumn("filename",input_file_name())

The column now has value like below. "abc://dev/folder1/date=20200813/id=1"

From the above column I have to create 2 different columns.

Date
ID

I have to get only date and id from the above file name and populate it to the columns mentioned above.

I can use split_col and get it. But if the folder structure changes then it might be a problem.

Is there a way to check if the file name has string "date" and "id" as part of it and get the values after the equal to symbol and populate it two new columns ?

Below is the expected output.

filename                             date     id
abc://dev/folder1/date=20200813/id=1 20200813 1

I dont have experience with Spark, just curious, why cant you do 'date' in string and 'id' in string? — Quantum Dreamer
– Quantum Dreamer, Commented Aug 13, 2020 at 5:38

ernest_k · Accepted Answer · 2020-08-13 06:10:01Z

You could use regexp_extract with a pattern that looks at the date= and id= substrings:

df = sc.parallelize(['abc://dev/folder1/date=20200813/id=1', 
                     'def://dev/folder25/id=3/date=20200814'])\
       .map(lambda l: Row(file=l)).toDF()

+-------------------------------------+
|file                                 |
+-------------------------------------+
|abc://dev/folder1/date=20200813/id=1 |
|def://dev/folder25/id=3/date=20200814|
+-------------------------------------+

df = df.withColumn('date', f.regexp_extract(f.col('file'), '(?<=date=)[0-9]+', 0))\
       .withColumn('id', f.regexp_extract(f.col('file'), '(?<=id=)[0-9]+', 0))
df.show(truncate=False)

Which outputs:

+-------------------------------------+--------+---+
|file                                 |date    |id |
+-------------------------------------+--------+---+
|abc://dev/folder1/date=20200813/id=1 |20200813|1  |
|def://dev/folder25/id=3/date=20200814|20200814|3  |
+-------------------------------------+--------+---+

thisisashish · Accepted Answer · 2022-08-04 10:16:24Z

I have used the withcolumn and split to break the column value into date and id by creating them as columns in the same dataset , code snippet is below:

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
adata = [("abc://dev/folder1/date=20200813/id=1",)]
aschema = StructType([StructField("filename",StringType(),True)])
adf = spark.createDataFrame(data=adata,schema=aschema)
bdf = adf.withColumn('date',  split(adf['filename'],'date=').getItem(1)[0:8]).withColumn('id',split(adf['filename'],'id=').getItem(1))
bdf.show(truncate=False)

Which outputs to :

+------------------------------------+--------+---+
|filename                            |date    |id |
+------------------------------------+--------+---+
|abc://dev/folder1/date=20200813/id=1|20200813|1  |
+------------------------------------+--------+---+

Collectives™ on Stack Overflow

Split file name into different columns of pyspark dataframe

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related