Adding None to PySpark array

Question

I want to create an array which is conditionally populated based off of existing column and sometimes I want it to contain None. Here's some example code:

from pyspark.sql import Row
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, array, lit
 
spark = SparkSession.builder.getOrCreate()
 
df = spark.createDataFrame([
    Row(ID=1),
    Row(ID=2),
    Row(ID=2),
    Row(ID=1)
])

value_lit = 0.45
size = 10

df = df.withColumn("TEST",when(df["ID"] == 2,array([None for i in range(size)])).otherwise(array([lit(value_lit) for i in range(size)])))

df.show(truncate=False)

And here's the error I'm getting:

TypeError: Invalid argument, not a string or column: None of type <type 'NoneType'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

I know it isn't a string or column, I don't see why it has to be?

lit: doesn't work.
array: I'm not sure how to use array in this context.
struct: probably the way to go but I'm not sure how to use it here. Perhaps I have to set an option to allow the new column to contain None values?
create_map: I'm not creating a key:value map so I'm sure this is not the correct one to use.

The lit function is missing from the first array function. df = df.withColumn("TEST", when(df["ID"] == 2, array([lit(None) for i in range(size)])).otherwise( array([lit(value_lit) for i in range(size)]))) — 过过招
– 过过招, Commented Jul 1, 2022 at 9:15
Tried that before but I get this nonsense: Traceback (most recent call last): df.show(truncate=False) py4j.protocol.Py4JJavaError: An error occurred while calling o128.showString. : scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$) at org.apache.spark.sql.catalyst.expressions.Cast.castToDouble(Cast.scala:531) — Rick Paddock
– Rick Paddock, Commented Jul 1, 2022 at 9:24
What version of Spark are you using? I use spark 3.1.2 and it works fine. — 过过招
– 过过招, Commented Jul 1, 2022 at 9:28
Why would like to create an array of nulls? What are you trying to solve exactly? — blackbishop
– blackbishop, Commented Jul 1, 2022 at 9:42

SherKhan · Accepted Answer · 2022-07-01 10:28:54Z

2

Try this is working for me, (lit before array):

from pyspark.sql import Row
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, array, lit

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([
    Row(ID=1),
    Row(ID=2),
    Row(ID=2),
    Row(ID=1)
])

value_lit = 0.45
size = 10

df = df.withColumn("TEST",when(df["ID"] == 2,array([lit(None) for i in range(size)])).otherwise(array([lit(value_lit) for i in range(size)])))

df.show(truncate=False)

Output:

answered Jul 1, 2022 at 10:28

SherKhan

962 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Rick Paddock Over a year ago

I copied your code exactly and I'm still getting the same error I mentioned in the other comment. How strange. Is it my version of Pyspark or something which is causing the problem?

py4j.protocol.Py4JJavaError: An error occurred while calling o128.showString. : scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$)

If I put speech marks around it like so lit("null") then it works but I don't want that as it'll give me a string and not actual missing values.

Rick Paddock Over a year ago

This also works if I remove the array: df = df.withColumn("TEST",when(df["ID"] == 2,None).otherwise(array([lit(value_lit) for i in range(size)]))) But this obviously only gives a single 'null' value and not an array. The combination of array & None is causing me issues and I can't understand why!? They both work independently of each other.

ZygD · Accepted Answer · 2022-07-01 13:02:30Z

1

The condition must be flipped: F.when(F.col('ID') != 2, value_lit)

If you do it, you don't need otherwise at all. If when condition is not satisfied, the result is always null.

Also, just one list comprehension is enough.

from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
 
df = spark.createDataFrame([(1,), (2,), (2,), (1,)], ['ID'])

value_lit = 0.45
size = 10

df = df.withColumn("TEST", F.array([F.when(F.col('ID') != 2, value_lit) for i in range(size)]))

df.show(truncate=False)
# +---+------------------------------------------------------------+
# |ID |TEST                                                        |
# +---+------------------------------------------------------------+
# |1  |[0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45]|
# |2  |[,,,,,,,,,]                                                 |
# |2  |[,,,,,,,,,]                                                 |
# |1  |[0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45]|
# +---+------------------------------------------------------------+

I've run this code on Spark 2.4.3.

edited Jul 1, 2022 at 13:02

answered Jul 1, 2022 at 12:48

ZygD

24.8k41 gold badges106 silver badges144 bronze badges

3 Comments

Rick Paddock Over a year ago

Cool thanks. So are these missing values the same as 'null' as per the results in the 1st answer (the same as specifically setting the value as None)? Also, any idea why my lit(None) wasn't working like it was for everyone else .... ?

ZygD Over a year ago

Yes, those missing values represent null, but this is how they are shown in your Spark version. The other answer is made using a higher version.

ZygD Over a year ago

Why your lit(None) was not working... I guess, that in your setup, you may have needed to specify the data type: lit(None).cast('double') - sometimes you need this. I haven't really tested your version, as I immediately saw a way to do it more efficiently

Collectives™ on Stack Overflow

Adding None to PySpark array

2 Answers 2

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related