Aggregate multiple columns in Spark dataframe

Question

I have issues in solving the following problem. Basically I want to find on which date a particular item(item_code) was sold maximum and minimum volume.

Input DataFrame

item_code, sold_date, price, volume
101,      10-12-2017, 20,    500
101,      11-12-2017, 20,    400
201,      10-12-2017, 50,    200
201,      13-12-2017, 51,    300

Expected output

Find max and min volume with sold date.I want this solution without using any lambda operations.

df.groupBy("item_code")agg(min("volume"),max("volume"))

the above one will help me to get max and min of volume but I want them along with respective date.

I tried my level best with udf but I could not crack it. any help highly appreciated.

it din help me. I want , on which sold_date , volume max/min for given item_code. first() returns same date to all my results. — Balaji Reddy
– Balaji Reddy, Commented Sep 4, 2017 at 7:24
In the groupBy clause, after you grouping it will be a list of dates so you must choose with an aggregate function between them. What aggregate function to you want to use ?. Take for example another row for id: 101, what date should be chosen ? — dumitru
– dumitru, Commented Sep 4, 2017 at 7:26
What means "along with respective date" ? . What should be the output if you add the following rows: 101, 9-12-2017, 20, 500 and 101, 6-12-2017, 20, 500 — dumitru
– dumitru, Commented Sep 4, 2017 at 7:54

Anahcolus · Accepted Answer · 2017-09-04 08:15:43Z

2

The final output you desire needs complex process. You can use the following process.

Given the input dataframe as

+---------+----------+-----+------+
|item_code|sold_date |price|volume|
+---------+----------+-----+------+
|101      |10-12-2017|20   |500   |
|101      |11-12-2017|20   |400   |
|201      |10-12-2017|50   |200   |
|201      |13-12-2017|51   |300   |
+---------+----------+-----+------+

You can use the following code

import org.apache.spark.sql.functions._
val tempDF = df.groupBy("item_code").agg(min("volume").as("min"),max("volume").as("max"))
tempDF.as("t2").join(df.as("t1"), col("t1.item_code") === col("t2.item_code") && col("t1.volume") === col("t2.min"), "left")
  .select($"t2.item_code", $"t2.max", concat_ws(",", $"t2.item_code", $"t2.min", $"t1.sold_date").as("min"))
  .join(df.as("t3"), col("t3.item_code") === col("t2.item_code") && col("t3.volume") === col("t2.max"), "left")
  .select($"min", concat_ws(",", $"t3.item_code", $"t2.max", $"t3.sold_date").as("max"))
  .show(false)

which is going to give you the dataframe you desire

+------------------+------------------+
|min               |max               |
+------------------+------------------+
|101,400,11-12-2017|101,500,10-12-2017|
|201,200,10-12-2017|201,300,13-12-2017|
+------------------+------------------+

answered Sep 4, 2017 at 8:15

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Balaji Reddy Over a year ago

Seems to be quite expensive process..... takes lot of time even against 200mb of dataset.

Anahcolus Over a year ago

joins are always expensive :)

Mário de Sá Vera · Accepted Answer · 2020-07-25 17:46:07Z

0

Best approach here is to create a new index (ie column) to the Dataframe as a result of concatenation of the columns required for sorting. Implement a smart sorting on String based new index so that the results are still sorted numerically but you carry along the information of Date and actually whatever you need to retrieve as part of the query.

That way there is no need for JOINs.

answered Jul 25, 2020 at 17:46

Mário de Sá Vera

3871 gold badge4 silver badges12 bronze badges

Collectives™ on Stack Overflow

Aggregate multiple columns in Spark dataframe

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related