1

I have the a DataFrame with the following schema loaded in Spark:

email, first_name, last_name, order_id

How can I group it by email, count the records in each group and return the a DataFrame with this schema:

email, first_name, last_name, order_count

1
  • You'll need to perform a traditional SQL group by on email, count(*) then perform a join on the same table to fetch the column you need. Commented Jan 25, 2016 at 15:46

1 Answer 1

3

This is a way to do it in Scala:

val df = sc.parallelize(Seq(("a","b","c",1),("a","b","c",2),("x","xb","xc",3),("y","yb","yc",4),("x","xb","xc",5))).toDF("email","first_name","last_name","order_id")

df.registerTempTable("df")

sqlContext.sql("select * from (select email, count(*) as order_count from df group by email ) d1 join df d2 on d1.email = d2.email")

In Java, considering that you already have your DataFrame created, it's actually the same code :

DataFrame results = sqlContext.sql("select * from (select email, count(*) as order_count from df group by email ) d1 join df d2 on d1.email = d2.email");

Nevertheless, and even thought this is straight-forward solution but I consider it as a bad practice because your code will be hard to maintain and evolve. A cleaner solution would be :

DataFrame email_count = df.groupBy("email").count();
DataFrame results2 = email_count.join(df, email_count.col("email").equalTo(df.col("email"))).drop(df.col("email"));
Sign up to request clarification or add additional context in comments.

1 Comment

This seems like to correct answer. One question though. Using groupBy on an RDD is allegedly slow due to data shuffling. Will I have this performance penalty with the groupBy on a DataFrame too?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.