5

I have a dataframe with following schema

hello.printSchema()
root
 |-- list_a: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- list_b: array (nullable = true)
 |    |-- element: integer (containsNull = true)

and following sample data

hello.take(2)
[Row(list_a=[7, 11, 1, 14, 13, 15,999], list_b=[15, 13, 7, 11, 1, 14]),
 Row(list_a=[7, 11, 1, 14, 13, 15], list_b=[11, 1, 7, 14, 15, 13, 12])]

Desired output

  1. Sort list_a and list_b
  2. Create a new column list_diff such that list_diff = list(set(list_a) - set(list_b)) Empty ArrayType if no such difference is present.

The approach I have tried is UDF.

As mentioned in the question, I am trying to use following UDFs

sort_udf=udf(lambda x: sorted(x), ArrayType(IntegerType()))
differencer=udf(lambda x,y: [elt for elt in x if elt not in y], ArrayType(IntegerType()))

Looks like python list operations are not supported.

hello = hello.withColumn('sorted', sort_udf(hello.list_a))
hello = hello.withColumn('difference', differencer(hello.list_a, hello.list_b))

The above operation result in following error

Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
[Redacted Stack Trace]
TypeError: 'NoneType' object is not iterable

Am I missing anything here?

3
  • My bad for the wrong paste. It should be ArrayType(IntegerType()) and not ArrayType(StringType()) Commented Aug 8, 2018 at 17:31
  • 2
    And for sorting the list, you don't need to use a udf - you can use pyspark.sql.functions.sort_array Commented Aug 8, 2018 at 17:37
  • Yup the default function pyspark.sql.functions.sort_array works well. just a small change in sorted udf sort_udf=udf(lambda x: sorted(x) if x else None, ArrayType(IntegerType()) and it works too. Commented Aug 8, 2018 at 17:58

1 Answer 1

6

The error message:

TypeError: 'NoneType' object is not iterable

Is a python exception (as opposed to a spark error), which means your code is failing inside your udf. Your issue is that you have some null values in your DataFrame. So when you call your udf, you may be passing None values to sorted:

>>> sorted(None)
TypeErrorTraceback (most recent call last)
<ipython-input-72-edb1060f46c4> in <module>()
----> 1 sorted(None)

TypeError: 'NoneType' object is not iterable

The way around this is to make your udf robust to bad inputs. In your case, you can change your functions to handle null inputs like this:

# return None if input is None
sort_udf = udf(lambda x: sorted(x) if x is not None else None, ArrayType(IntegerType()))

# return None if either x or y are None
differencer = udf(
    lambda x,y: [e for e in x if e not in y] if x is not None and y is not None else None,
    ArrayType(IntegerType())
)

However, the sort_udf function is not necessary, as you can use pyspark.sql.functions.sort_array() instead.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.