Spark dataframe aggregate on multiple columns

Question

Actually I am working on pyspark code. My dataframe is

+-------+--------+--------+--------+--------+
|element|collect1|collect2|collect3|collect4|
+-------+--------+--------+--------+--------+
|A1     |   1.02 |    2.6 |   5.21 |    3.6 |
|A2     |   1.61 |   2.42 |   4.88 |   6.08 |
|B1     |   1.66 |   2.01 |    5.0 |    4.3 |
|C2     |   2.01 |   1.85 |   3.42 |   4.44 |
+-------+--------+--------+--------+--------+

I need to find the mean and stddev for each element by aggregating all the collectX columns. The final result should be as below.

+-------+--------+--------+
|element|mean    |stddev  |
+-------+--------+--------+
|A1     |   3.11 |   1.76 |
|A2     |   3.75 |   2.09 |
|B1     |   3.24 |   1.66 |
|C2     |   2.93 |   1.23 |
+-------+--------+--------+

The below code breakdown all the mean at individual columns df.groupBy("element").mean().show(). Instead of doing for each column, is it possible to rollup for all the columns?

+-------+-------------+-------------+-------------+-------------+
|element|avg(collect1)|avg(collect2)|avg(collect3)|avg(collect4)|
+-------+-------------+-------------+-------------+-------------+
|A1     |   1.02      |   2.6       |   5.21      |    3.6      |
|A2     |   1.61      |   2.42      |   4.88      |   6.08      |
|B1     |   1.66      |   2.01      |    5.0      |    4.3      |
|C2     |   2.01      |   1.85      |   3.42      |   4.44      |
+-------+-------------+-------------+-------------+-------------+

I tried to make use of the describe function as it has the complete aggregation functions but still shown as individual column df.groupBy("element").mean().describe().show()

thanks

the.malkolm · Accepted Answer · 2016-03-24 21:40:00Z

Spark allows you to gather all sort of stats per column. You are trying to calculate stats per row. In this case you can hack something with udf. Here is an example :D

$ pyspark
>>> from pyspark.sql.types import DoubleType
>>> from pyspark.sql.functions import array, udf
>>>
>>> mean = udf(lambda v: sum(v) / len(v), DoubleType())
>>> df = sc.parallelize([['A1', 1.02, 2.6, 5.21, 3.6], ['A2', 1.61, 2.42, 4.88, 6.08]]).toDF(['element', 'collect1', 'collect2', 'collect3', 'collect4'])
>>> df.show()
+-------+--------+--------+--------+--------+
|element|collect1|collect2|collect3|collect4|
+-------+--------+--------+--------+--------+
|     A1|    1.02|     2.6|    5.21|     3.6|
|     A2|    1.61|    2.42|    4.88|    6.08|
+-------+--------+--------+--------+--------+
>>> df.select('element', mean(array(df.columns[1:])).alias('mean')).show()
+-------+------+
|element|  mean|
+-------+------+
|     A1|3.1075|
|     A2|3.7475|
+-------+------+

David Griffin · Accepted Answer · 2016-03-24 15:01:14Z

0

Did you try just adding the columns together and possibly dividing by 4?

SELECT avg((collect1 + collect2 + collect3 + collect4) / 4),
  stddev((collect1 + collect2 + collect3 + collect4) / 4)

That's not going to do exactly what you want but get the idea.

Not sure your language, but you can always build the query on the fly if you aren't happy with hardcoded:

val collectColumns = df.columns.filter(_.startsWith("collect"))
val stmnt = "SELECT avg((" + collectColumns.mkString(" + ") + ") / " + collectColumns.length + "))"

You get the idea.

edited Mar 24, 2016 at 15:01

answered Mar 24, 2016 at 13:22

David Griffin

14k5 gold badges49 silver badges66 bronze badges

1 Comment

Chn Over a year ago

Actually, the dataframe may have more or less of collectX columns. Doing hardcoded is not the preferred choice.

Collectives™ on Stack Overflow

Spark dataframe aggregate on multiple columns

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related