5

Is there a way to replace null values in a column with empty string when writing spark dataframe to file?

Sample data:

+----------------+------------------+
|   UNIQUE_MEM_ID|              DATE|
+----------------+------------------+
|            1156|              null|
|            3787|        2016-07-05|
|            1156|              null|
|            5064|              null|
|            5832|              null|
|            3787|              null|
|            5506|              null|
|            7538|              null|
|            7436|              null|
|            5091|              null|
|            8673|              null|
|            2631|              null|
|            8561|              null|
|            3516|              null|
|            1156|              null|
|            5832|              null|
|            2631|        2016-07-07|
1
  • 1
    I think @shu answer will be quicker than mine. you can crosscheck.. Commented Jul 29, 2020 at 18:11

2 Answers 2

8

check this out. you can when and otherwise.

    df.show()

    #InputDF
    # +-------------+----------+
    # |UNIQUE_MEM_ID|      DATE|
    # +-------------+----------+
    # |         1156|      null|
    # |         3787|2016-07-05|
    # |         1156|      null|
    # +-------------+----------+


    df.withColumn("DATE", F.when(F.col("DATE").isNull(), '').otherwise(F.col("DATE"))).show()

    #OUTPUTDF
    # +-------------+----------+
    # |UNIQUE_MEM_ID|      DATE|
    # +-------------+----------+
    # |         1156|          |
    # |         3787|2016-07-05|
    # |         1156|          |
    # +-------------+----------+

To apply the above logic to all the columns of dataframe. you can use for loop and iterate through columns and fill empty string when column value is null.

 df.select( *[ F.when(F.col(column).isNull(),'').otherwise(F.col(column)).alias(column) for column in df.columns]).show()
Sign up to request clarification or add additional context in comments.

3 Comments

This works. but can we scale it on entire dtafarame without specifying each individual cols
hi, what is F?
@wawawa, you can import pyspark sql functions as below and alias with F from pyspark.sql import functions as F
5

Use either .na.fill(),fillna() functions for this case.

  • If you have all string columns then df.na.fill('') will replace all null with '' on all columns.
  • For int columns df.na.fill('').na.fill(0) replace null with 0
  • Another way would be creating a dict for the columns and replacement value df.fillna({'col1':'replacement_value',...,'col(n)':'replacement_value(n)'})

Example:

df.show()
#+-------------+----------+
#|UNIQUE_MEM_ID|      DATE|
#+-------------+----------+
#|         1156|      null|
#|         3787|      null|
#|         2631|2016007-07|
#+-------------+----------+
from pyspark.sql.functions import *

df.na.fill('').show()
df.fillna({'DATE':''}).show()
#+-------------+----------+
#|UNIQUE_MEM_ID|      DATE|
#+-------------+----------+
#|         1156|          |
#|         3787|          |
#|         2631|2016007-07|
#+-------------+----------+

2 Comments

Same question @Shu, how can scale this on all df columns
@ben, if you have all string columns then df.na.fill('') will replace all null with '' on all columns, for int columns df.na.fill('').na.fill(0) replace null with 0.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.