I have following dataframe:
+----------+
|col |
+----------+
|[1, 4, 3] |
|[1, 5, 11]|
|[1, 3, 3] |
|[1, 4, 3] |
|[1, 6, 3] |
|[1, 1, 3] |
+----------+
What I want is:
+----------+
|col_new |
+----------+
|[3, -1] |
|[4, 6] |
|[2, 0] |
|[3, -1] |
|[5, -3] |
|[0, 2] |
+----------+
=> Diff operator arr[n+1] - arr[n]
And I don't know how I should do it.
I thought I should do it with udf? I'm not really familiar with it but yeah I tried.
from pyspark.sql.functions import col
def diff(a):
return [a[ii+1]-a[ii] for ii in range(a.__len__()-1)]
function = udf(lambda c: diff(c))
df.withColumn("col_new",function(col("col"))).show(20,False)
But yeah that didn't work of course since I need a list... but I want to use the power of dataframe... Does someone have a hint for me?
Best Boendal
function = udf(lambda c: diff(c), ArrayType(IntegerType())), which will cause "col_new" to be null. Also:df.withColumn("col_new",function("col")).show(20,False)(remove the extracol)