1

I am using pyspark SQL function input_file_name to add the input file name as a dataframe column.

df = df.withColumn("filename",input_file_name())

The column now has value like below. "abc://dev/folder1/date=20200813/id=1"

From the above column I have to create 2 different columns.

  1. Date
  2. ID

I have to get only date and id from the above file name and populate it to the columns mentioned above.

I can use split_col and get it. But if the folder structure changes then it might be a problem.

Is there a way to check if the file name has string "date" and "id" as part of it and get the values after the equal to symbol and populate it two new columns ?

Below is the expected output.

filename                             date     id
abc://dev/folder1/date=20200813/id=1 20200813 1
1
  • I dont have experience with Spark, just curious, why cant you do 'date' in string and 'id' in string? Commented Aug 13, 2020 at 5:38

2 Answers 2

1

You could use regexp_extract with a pattern that looks at the date= and id= substrings:

df = sc.parallelize(['abc://dev/folder1/date=20200813/id=1', 
                     'def://dev/folder25/id=3/date=20200814'])\
       .map(lambda l: Row(file=l)).toDF()
+-------------------------------------+
|file                                 |
+-------------------------------------+
|abc://dev/folder1/date=20200813/id=1 |
|def://dev/folder25/id=3/date=20200814|
+-------------------------------------+
df = df.withColumn('date', f.regexp_extract(f.col('file'), '(?<=date=)[0-9]+', 0))\
       .withColumn('id', f.regexp_extract(f.col('file'), '(?<=id=)[0-9]+', 0))
df.show(truncate=False)

Which outputs:

+-------------------------------------+--------+---+
|file                                 |date    |id |
+-------------------------------------+--------+---+
|abc://dev/folder1/date=20200813/id=1 |20200813|1  |
|def://dev/folder25/id=3/date=20200814|20200814|3  |
+-------------------------------------+--------+---+
Sign up to request clarification or add additional context in comments.

Comments

0

I have used the withcolumn and split to break the column value into date and id by creating them as columns in the same dataset , code snippet is below:

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
adata = [("abc://dev/folder1/date=20200813/id=1",)]
aschema = StructType([StructField("filename",StringType(),True)])
adf = spark.createDataFrame(data=adata,schema=aschema)
bdf = adf.withColumn('date',  split(adf['filename'],'date=').getItem(1)[0:8]).withColumn('id',split(adf['filename'],'id=').getItem(1))
bdf.show(truncate=False)

Which outputs to :

+------------------------------------+--------+---+
|filename                            |date    |id |
+------------------------------------+--------+---+
|abc://dev/folder1/date=20200813/id=1|20200813|1  |
+------------------------------------+--------+---+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.