'NoneType' object is not iterable error using udf on ArrayType in PySpark DataFrame

Question

I have a dataframe with following schema

hello.printSchema()
root
 |-- list_a: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- list_b: array (nullable = true)
 |    |-- element: integer (containsNull = true)

and following sample data

hello.take(2)
[Row(list_a=[7, 11, 1, 14, 13, 15,999], list_b=[15, 13, 7, 11, 1, 14]),
 Row(list_a=[7, 11, 1, 14, 13, 15], list_b=[11, 1, 7, 14, 15, 13, 12])]

Desired output

Sort list_a and list_b
Create a new column list_diff such that list_diff = list(set(list_a) - set(list_b)) Empty ArrayType if no such difference is present.

The approach I have tried is UDF.

As mentioned in the question, I am trying to use following UDFs

sort_udf=udf(lambda x: sorted(x), ArrayType(IntegerType()))
differencer=udf(lambda x,y: [elt for elt in x if elt not in y], ArrayType(IntegerType()))

Looks like python list operations are not supported.

hello = hello.withColumn('sorted', sort_udf(hello.list_a))
hello = hello.withColumn('difference', differencer(hello.list_a, hello.list_b))

The above operation result in following error

Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
[Redacted Stack Trace]
TypeError: 'NoneType' object is not iterable

Am I missing anything here?

My bad for the wrong paste. It should be ArrayType(IntegerType()) and not ArrayType(StringType()) — malhar
– malhar, Commented Aug 8, 2018 at 17:31
And for sorting the list, you don't need to use a udf - you can use pyspark.sql.functions.sort_array — pault
– pault, Commented Aug 8, 2018 at 17:37
Yup the default function pyspark.sql.functions.sort_array works well. just a small change in sorted udf sort_udf=udf(lambda x: sorted(x) if x else None, ArrayType(IntegerType()) and it works too. — malhar
– malhar, Commented Aug 8, 2018 at 17:58

pault · Accepted Answer · 2018-08-08 19:30:58Z

The error message:

TypeError: 'NoneType' object is not iterable

Is a python exception (as opposed to a spark error), which means your code is failing inside your udf. Your issue is that you have some null values in your DataFrame. So when you call your udf, you may be passing None values to sorted:

>>> sorted(None)
TypeErrorTraceback (most recent call last)
<ipython-input-72-edb1060f46c4> in <module>()
----> 1 sorted(None)

TypeError: 'NoneType' object is not iterable

The way around this is to make your udf robust to bad inputs. In your case, you can change your functions to handle null inputs like this:

# return None if input is None
sort_udf = udf(lambda x: sorted(x) if x is not None else None, ArrayType(IntegerType()))

# return None if either x or y are None
differencer = udf(
    lambda x,y: [e for e in x if e not in y] if x is not None and y is not None else None,
    ArrayType(IntegerType())
)

However, the sort_udf function is not necessary, as you can use pyspark.sql.functions.sort_array() instead.

Collectives™ on Stack Overflow

'NoneType' object is not iterable error using udf on ArrayType in PySpark DataFrame

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related