0

I'm using PySpark and want to add a yyyy_mm_dd string to my DataFrame as a column, I have tried doing it like this:

end_date = '2020-01-20'
final = (
    df1
    .join(df, on = ['id', 'product'], how = 'left_outer')
    .where((sf.col('id').isNotNull()))
    .withColumn('status', when(sf.col('count') >= 10, 3)
    .when((sf.col('count') <= 9) & (sf.col('count') >= 1), 2)
    .when(sf.col('count').isNull(), 1))
    .withColumn('yyyy_mm_dd', end_date)
)
final.fillna(0, subset=['count']).orderBy('id', 'product').show(500,False)

This works without the last .withColumn, but I run into the below error when I include it:

AssertionError: col should be Column

From the docs, it seems I should be passing in a col as the second parameter to withColumn. Though, I'm unsure how to convert my date string to type col. I saw this solution from another post but I don't want to use current_date() since my end_date var will be read in from a coordinator script.

2 Answers 2

1

Use lit:

.withColumn('yyyy_mm_dd', sf.lit(end_date))

If you want a date type, you can cast accordingly:

.withColumn('yyyy_mm_dd', sf.lit(end_date).cast("date"))
Sign up to request clarification or add additional context in comments.

Comments

1

Please check with_column documentation. It get the column name as first arg and a type of col as a second arg. You can use lit() to turn you string to a col With a const value.

pyspark.sql.functions.lit(col) Creates a Column of literal value.

df.select(lit(5).alias('height')).withColumn('spark_user', lit(True)).take(1) [Row(height=5, spark_user=True)]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.