Group and aggregate a DataFrame with Apache Spark and Java?

Question

I have the a DataFrame with the following schema loaded in Spark:

email, first_name, last_name, order_id

How can I group it by email, count the records in each group and return the a DataFrame with this schema:

email, first_name, last_name, order_count

You'll need to perform a traditional SQL group by on email, count(*) then perform a join on the same table to fetch the column you need. — eliasah
– eliasah, Commented Jan 25, 2016 at 15:46

eliasah · Accepted Answer · 2016-01-26 08:43:04Z

3

This is a way to do it in Scala:

val df = sc.parallelize(Seq(("a","b","c",1),("a","b","c",2),("x","xb","xc",3),("y","yb","yc",4),("x","xb","xc",5))).toDF("email","first_name","last_name","order_id")

df.registerTempTable("df")

sqlContext.sql("select * from (select email, count(*) as order_count from df group by email ) d1 join df d2 on d1.email = d2.email")

In Java, considering that you already have your DataFrame created, it's actually the same code :

DataFrame results = sqlContext.sql("select * from (select email, count(*) as order_count from df group by email ) d1 join df d2 on d1.email = d2.email");

Nevertheless, and even thought this is straight-forward solution but I consider it as a bad practice because your code will be hard to maintain and evolve. A cleaner solution would be :

DataFrame email_count = df.groupBy("email").count();
DataFrame results2 = email_count.join(df, email_count.col("email").equalTo(df.col("email"))).drop(df.col("email"));

edited Jan 26, 2016 at 8:43

answered Jan 25, 2016 at 16:03

eliasah

40.5k12 gold badges128 silver badges159 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Milen Kovachev Over a year ago

This seems like to correct answer. One question though. Using groupBy on an RDD is allegedly slow due to data shuffling. Will I have this performance penalty with the groupBy on a DataFrame too?

Collectives™ on Stack Overflow

Group and aggregate a DataFrame with Apache Spark and Java?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related