I have a dataframe with following schema
hello.printSchema()
root
|-- list_a: array (nullable = true)
| |-- element: long (containsNull = true)
|-- list_b: array (nullable = true)
| |-- element: integer (containsNull = true)
and following sample data
hello.take(2)
[Row(list_a=[7, 11, 1, 14, 13, 15,999], list_b=[15, 13, 7, 11, 1, 14]),
Row(list_a=[7, 11, 1, 14, 13, 15], list_b=[11, 1, 7, 14, 15, 13, 12])]
Desired output
- Sort
list_aandlist_b - Create a new column
list_diffsuch thatlist_diff = list(set(list_a) - set(list_b))Empty ArrayType if no such difference is present.
The approach I have tried is UDF.
As mentioned in the question, I am trying to use following UDFs
sort_udf=udf(lambda x: sorted(x), ArrayType(IntegerType()))
differencer=udf(lambda x,y: [elt for elt in x if elt not in y], ArrayType(IntegerType()))
Looks like python list operations are not supported.
hello = hello.withColumn('sorted', sort_udf(hello.list_a))
hello = hello.withColumn('difference', differencer(hello.list_a, hello.list_b))
The above operation result in following error
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
[Redacted Stack Trace]
TypeError: 'NoneType' object is not iterable
Am I missing anything here?
ArrayType(IntegerType())and notArrayType(StringType())udf- you can usepyspark.sql.functions.sort_arraypyspark.sql.functions.sort_arrayworks well. just a small change in sorted udfsort_udf=udf(lambda x: sorted(x) if x else None, ArrayType(IntegerType())and it works too.