I'm using PySpark and want to add a yyyy_mm_dd string to my DataFrame as a column, I have tried doing it like this:
end_date = '2020-01-20'
final = (
df1
.join(df, on = ['id', 'product'], how = 'left_outer')
.where((sf.col('id').isNotNull()))
.withColumn('status', when(sf.col('count') >= 10, 3)
.when((sf.col('count') <= 9) & (sf.col('count') >= 1), 2)
.when(sf.col('count').isNull(), 1))
.withColumn('yyyy_mm_dd', end_date)
)
final.fillna(0, subset=['count']).orderBy('id', 'product').show(500,False)
This works without the last .withColumn, but I run into the below error when I include it:
AssertionError: col should be Column
From the docs, it seems I should be passing in a col as the second parameter to withColumn. Though, I'm unsure how to convert my date string to type col. I saw this solution from another post but I don't want to use current_date() since my end_date var will be read in from a coordinator script.