2

I would like to create a new column on a dataframe, which is the result of applying a function to an arraytype column.

Something like this:

df = df.withColumn("max_$colname", max(col(colname)))

where each row of the column holds an array of values?

The functions in spark.sql.function appear to work on a column basis only.

2
  • what kind of function you want to apply? Commented Mar 24, 2018 at 17:11
  • Any of the standard summary statistics: min, max, count, mean, variance, etc. Commented Mar 24, 2018 at 18:17

1 Answer 1

1

You can apply user defined functions on the array column.

1.DataFrame

+------------------+
|               arr|
+------------------+
|   [1, 2, 3, 4, 5]|
|[4, 5, 6, 7, 8, 9]|
+------------------+

2.Creating UDF

import org.apache.spark.sql.functions._
def max(arr: TraversableOnce[Int])=arr.toList.max
val maxUDF=udf(max(_:Traversable[Int]))

3.Applying UDF in query

df.withColumn("arrMax",maxUDF(df("arr"))).show

4.Result

+------------------+------+
|               arr|arrMax|
+------------------+------+
|   [1, 2, 3, 4, 5]|     5|
|[4, 5, 6, 7, 8, 9]|     9|
+------------------+------+
Sign up to request clarification or add additional context in comments.

1 Comment

I wrote code to find max from array. Similar way you can write logic to do any operation on array.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.