Aggregate array type in Spark Dataframe

Question

I have a DataFrame orders:

+-----------------+-----------+--------------+
|               Id|    Order  |        Gender|
+-----------------+-----------+--------------+
|             1622|[101330001]|          Male|
|             1622|   [147678]|          Male|
|             3837|  [1710544]|          Male|
+-----------------+-----------+--------------+

which I want to groupBy on Id and Gender and then aggregate orders. I am using org.apache.spark.sql.functions package and code looks like:

DataFrame group = orders.withColumn("orders", col("order"))
                .groupBy(col("Id"), col("Gender"))
                .agg(collect_list("products"));

However since column Order is of type array I get this exception because it expects a primitive type:

User class threw exception: org.apache.spark.sql.AnalysisException: No handler for Hive udf class org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList because: Only primitive type arguments are accepted but array<string> was passed as parameter 1

I have looked in the package and there are sort functions for arrays but no aggregate functions. Any idea how to do it? Thanks.

Shivansh · Accepted Answer · 2016-06-30 08:25:41Z

1

In this case you can define your own function and register it as UDF

val userDefinedFunction = ???
val udfFunctionName = udf[U,T](userDefinedFunction)

Then instead of then pass that column inside that function so that it gets converted into primitive type and then pass it in the with Columns method.

Something like this:

val dataF:(Array[Int])=>Int=_.head

val dataUDF=udf[Int,Array[Int]](dataF)


DataFrame group = orders.withColumn("orders", dataUDF(col("order")))
                .groupBy(col("Id"), col("Gender"))
                .agg(collect_list("products"));

I hope it works !

answered Jun 30, 2016 at 8:25

Shivansh

3,55426 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Aggregate array type in Spark Dataframe

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related