dataframe with nested aggregation

Question

i have json file which looks like this:

{{"name":"jonh", "food":"tomato", "weight": 1},
 {"name":"jonh", "food":"carrot", "weight": 4},
 {"name":"bill", "food":"apple", "weight": 1},
 {"name":"john", "food":"tomato", "weight": 2},
 {"name":"bill", "food":"taco", "weight": 2}},
 {"name":"bill", "food":"taco", "weight": 4}},

i need to create new json like this:

   {
     {"name":"jonh",
      "buy": [{"tomato": 3},{"carrot": 4}]
     },
     {"name":"bill",
      "buy": [{"apple": 1},{"taco": 6}]
     } 
   }

this is my dataFrame

val df = Seq(
  ("john", "tomato", 1),
  ("john", "carrot", 4),
  ("bill", "apple", 1),
  ("john", "tomato", 2),
  ("bill", "taco", 2),
  ("bill", "taco", 4)            
).toDF("name", "food", "weight")

how can i get dataframe with final structure? groupBy and agg gives me wrong structure

import org.apache.spark.sql.functions._
df.groupBy("name", "food").agg(sum("weight").as("weight"))
  .groupBy("name").agg(collect_list(struct("food", "weight")).as("acc"))

+----+------------------------+
|name|acc                     |
+----+------------------------+
|john|[[carrot,4], [tomato,3]]|
|bill|[[taco,6], [apple,1]]   |
+----+------------------------+

{"name":"john","acc":[{"food":"carrot","weight":4},{"food":"tomato","weight":3}]}
{"name":"bill","acc":[{"food":"taco","weight":6},{"food":"apple","weight":1}]}

please give me right direction how to solve it.

Alex Savitsky · Accepted Answer · 2018-02-28 15:01:02Z

1

You can always convert the values manually, by iterating over Rows, and assembling the food-weight pairs, and then converting them to a Map

val step1 = df.groupBy("name", "food").agg(sum("weight").as("weight")).
    groupBy("name").agg(collect_list(struct("food", "weight")).as("buy"))
val result = step1.map(row =>
    (row.getAs[String]("name"), row.getAs[Seq[Row]]("buy").map(map =>
        map.getAs[String]("food") -> map.getAs[Long]("weight")).toMap)
    ).toDF("name", "buy")
result.toJSON.show(false)

+---------------------------------------------+
|{"name":"john","buy":{"carrot":4,"tomato":3}}|
|{"name":"bill","buy":{"taco":6,"apple":1}}   |
+---------------------------------------------+

edited Feb 28, 2018 at 15:01

answered Feb 28, 2018 at 14:53

Alex Savitsky

2,3815 gold badges24 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

bboy Over a year ago

Looks great. I thought about map, but didn’t really understood how exactly. I will check and update. Thanks!

Anahcolus · Accepted Answer · 2018-02-28 18:13:55Z

You can achive your required json format by using replace techniques

udf way

udf function works on primitive data types so replace function can be used to replace the food and weight string from final dataframe as

import org.apache.spark.sql.functions._
def replaeUdf = udf((json: String) => json.replace("\"food\":", "").replace("\"weight\":", ""))

val temp = df.groupBy("name", "food").agg(sum("weight").as("weight"))
  .groupBy("name").agg(collect_list(struct(col("food"), col("weight"))).as("buy"))
  .toJSON.withColumn("value", replaeUdf(col("value")))

You should have output dataframe as

+-------------------------------------------------+
|value                                            |
+-------------------------------------------------+
|{"name":"john","buy":[{"carrot",4},{"tomato",3}]}|
|{"name":"bill","buy":[{"taco",6},{"apple",1}]}   |
+-------------------------------------------------+

regex_replace function

regex_replace inbuilt function can be used to get the desired output as well

val temp = df.groupBy("name", "food").agg(sum("weight").as("weight"))
  .groupBy("name").agg(collect_list(struct(col("food"), col("weight"))).as("buy"))
  .toJSON.withColumn("value", regexp_replace(regexp_replace(col("value"), "\"food\":", ""), "\"weight\":", ""))

Collectives™ on Stack Overflow

dataframe with nested aggregation

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related