1

I want to create an array which is conditionally populated based off of existing column and sometimes I want it to contain None. Here's some example code:

from pyspark.sql import Row
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, array, lit
 
spark = SparkSession.builder.getOrCreate()
 
df = spark.createDataFrame([
    Row(ID=1),
    Row(ID=2),
    Row(ID=2),
    Row(ID=1)
])

value_lit = 0.45
size = 10

df = df.withColumn("TEST",when(df["ID"] == 2,array([None for i in range(size)])).otherwise(array([lit(value_lit) for i in range(size)])))

df.show(truncate=False)

And here's the error I'm getting:

TypeError: Invalid argument, not a string or column: None of type <type 'NoneType'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

I know it isn't a string or column, I don't see why it has to be?

  • lit: doesn't work.
  • array: I'm not sure how to use array in this context.
  • struct: probably the way to go but I'm not sure how to use it here. Perhaps I have to set an option to allow the new column to contain None values?
  • create_map: I'm not creating a key:value map so I'm sure this is not the correct one to use.
6
  • The lit function is missing from the first array function. df = df.withColumn("TEST", when(df["ID"] == 2, array([lit(None) for i in range(size)])).otherwise( array([lit(value_lit) for i in range(size)]))) Commented Jul 1, 2022 at 9:15
  • Tried that before but I get this nonsense: Traceback (most recent call last): df.show(truncate=False) py4j.protocol.Py4JJavaError: An error occurred while calling o128.showString. : scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$) at org.apache.spark.sql.catalyst.expressions.Cast.castToDouble(Cast.scala:531) Commented Jul 1, 2022 at 9:24
  • What version of Spark are you using? I use spark 3.1.2 and it works fine. Commented Jul 1, 2022 at 9:28
  • I'm on version 2.4.3 Commented Jul 1, 2022 at 9:31
  • Why would like to create an array of nulls? What are you trying to solve exactly? Commented Jul 1, 2022 at 9:42

2 Answers 2

2

Try this is working for me, (lit before array):

from pyspark.sql import Row
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, array, lit

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([
    Row(ID=1),
    Row(ID=2),
    Row(ID=2),
    Row(ID=1)
])

value_lit = 0.45
size = 10

df = df.withColumn("TEST",when(df["ID"] == 2,array([lit(None) for i in range(size)])).otherwise(array([lit(value_lit) for i in range(size)])))

df.show(truncate=False)

Output:

enter image description here

Sign up to request clarification or add additional context in comments.

2 Comments

I copied your code exactly and I'm still getting the same error I mentioned in the other comment. How strange. Is it my version of Pyspark or something which is causing the problem? py4j.protocol.Py4JJavaError: An error occurred while calling o128.showString. : scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$) If I put speech marks around it like so lit("null") then it works but I don't want that as it'll give me a string and not actual missing values.
This also works if I remove the array: df = df.withColumn("TEST",when(df["ID"] == 2,None).otherwise(array([lit(value_lit) for i in range(size)]))) But this obviously only gives a single 'null' value and not an array. The combination of array & None is causing me issues and I can't understand why!? They both work independently of each other.
1

The condition must be flipped: F.when(F.col('ID') != 2, value_lit)

If you do it, you don't need otherwise at all. If when condition is not satisfied, the result is always null.

Also, just one list comprehension is enough.

from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
 
df = spark.createDataFrame([(1,), (2,), (2,), (1,)], ['ID'])

value_lit = 0.45
size = 10

df = df.withColumn("TEST", F.array([F.when(F.col('ID') != 2, value_lit) for i in range(size)]))

df.show(truncate=False)
# +---+------------------------------------------------------------+
# |ID |TEST                                                        |
# +---+------------------------------------------------------------+
# |1  |[0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45]|
# |2  |[,,,,,,,,,]                                                 |
# |2  |[,,,,,,,,,]                                                 |
# |1  |[0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45, 0.45]|
# +---+------------------------------------------------------------+

I've run this code on Spark 2.4.3.

3 Comments

Cool thanks. So are these missing values the same as 'null' as per the results in the 1st answer (the same as specifically setting the value as None)? Also, any idea why my lit(None) wasn't working like it was for everyone else .... ?
Yes, those missing values represent null, but this is how they are shown in your Spark version. The other answer is made using a higher version.
Why your lit(None) was not working... I guess, that in your setup, you may have needed to specify the data type: lit(None).cast('double') - sometimes you need this. I haven't really tested your version, as I immediately saw a way to do it more efficiently

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.