I am trying to rewrite an UDF into pandas UDF.
However when it comes to the column with ArrayType inside. I am struggling to find the right solution.
I have a dataframe as below:
+-----------+--------------------+
| genre| ids|
+-----------+--------------------+
| Crime|[6, 22, 42, 47, 5...|
| Romance|[3, 7, 11, 15, 17...|
| Thriller|[6, 10, 16, 18, 2...|
| Adventure|[2, 8, 10, 15, 29...|
| Children|[1, 2, 8, 13, 34,...|
| Drama|[4, 11, 14, 16, 1...|
| War|[41, 110, 151, 15...|
|Documentary|[37, 77, 99, 108,...|
| Fantasy|[2, 56, 60, 126, ...|
| Mystery|[59, 113, 123, 16...|
+-----------+--------------------+
The following UDF works well:
pairs_udf = udf(lambda x: itertools.combinations(x, 2), transformer.schema)
df = df.select("genre", pairs_udf("ids").alias("ids"))
The output is like:
+-----------+--------------------+
| genre| ids|
+-----------+--------------------+
| Crime|[[6, 22], [6, 42]...|
| Romance|[[3, 7], [3, 11],...|
| Thriller|[[6, 10], [6, 16]...|
| Adventure|[[2, 8], [2, 10],...|
| Children|[[1, 2], [1, 8], ...|
| Drama|[[4, 11], [4, 14]...|
| War|[[41, 110], [41, ...|
|Documentary|[[37, 77], [37, 9...|
| Fantasy|[[2, 56], [2, 60]...|
| Mystery|[[59, 113], [59, ...|
+-----------+--------------------+
However, what would be the equivalent when writing the function in pandas udf.
PS: I understand, alternatively, I can use cross-join to achieve the same results.
But, I am more curious about how do pandas udf handle column with ArrayType.
lambda row: row.apply(lambda x: itertools.combinations(x, 2))java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.<init>(long, int) not available. after a few google search, it seems to do with java 11 and spark-arrow support. which may belong to a separate question.