Aggregate over column arrays in DataFrame in PySpark?

Question

Let's say I have the following DataFrame:

[Row(user='bob', values=[0.5, 0.3, 0.2]),
Row(user='bob', values=[0.1, 0.3, 0.6]),
Row(user='bob', values=[0.8, 0.1, 0.1])]

I would like to groupBy user and do something like avg(values) where the average is taken over each index of the array values like this:

[Row(user='bob', avgerages=[0.466667, 0.233333, 0.3])]

How can I do this in PySpark?

zero323 · Accepted Answer · 2017-04-25 23:31:35Z

14

You can expand array and compute average for each index.

Python

from pyspark.sql.functions import array, avg, col

n = len(df.select("values").first()[0])

df.groupBy("user").agg(
    array(*[avg(col("values")[i]) for i in range(n)]).alias("averages")
)

Scala

import spark.implicits._
import org.apache.spark.functions.{avg, size}

val df = Seq(
  ("bob", Seq(0.5, 0.3, 0.2)),
  ("bob", Seq(0.1, 0.3, 0.6))
).toDF("user", "values")

val n = df.select(size($"values")).as[Int].first
val values = (0 to n).map(i => $"values"(i))

df.select($"user" +: values: _*).groupBy($"user").avg()

edited Apr 25, 2017 at 23:31

answered Aug 16, 2016 at 19:12

zero323

331k108 gold badges981 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Evan Zamir Over a year ago

What does the * do in this case? Also, is there a way like in Pandas where I could pass each group to a user-defined function and do the operation in there? Thanks.

zero323 Over a year ago

* is a standard Python argument unpacking. No, Python doesn't support UDAFs. You can use RDDs directly or define JVM one.

Evan Zamir Over a year ago

Thanks! I think RDD makes sense here.

zero323 Over a year ago

If you want to give RDD a try you can use a subset (compute_stats without collect) of this answer.

zero323 Over a year ago

@Gevorg Here you are. You may also find interesting stackoverflow.com/q/41731865/1560062

|

Collectives™ on Stack Overflow

Aggregate over column arrays in DataFrame in PySpark?

1 Answer 1

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related