1

i have json file which looks like this:

{{"name":"jonh", "food":"tomato", "weight": 1},
 {"name":"jonh", "food":"carrot", "weight": 4},
 {"name":"bill", "food":"apple", "weight": 1},
 {"name":"john", "food":"tomato", "weight": 2},
 {"name":"bill", "food":"taco", "weight": 2}},
 {"name":"bill", "food":"taco", "weight": 4}},

i need to create new json like this:

   {
     {"name":"jonh",
      "buy": [{"tomato": 3},{"carrot": 4}]
     },
     {"name":"bill",
      "buy": [{"apple": 1},{"taco": 6}]
     } 
   }

this is my dataFrame

val df = Seq(
  ("john", "tomato", 1),
  ("john", "carrot", 4),
  ("bill", "apple", 1),
  ("john", "tomato", 2),
  ("bill", "taco", 2),
  ("bill", "taco", 4)            
).toDF("name", "food", "weight")

how can i get dataframe with final structure? groupBy and agg gives me wrong structure

import org.apache.spark.sql.functions._
df.groupBy("name", "food").agg(sum("weight").as("weight"))
  .groupBy("name").agg(collect_list(struct("food", "weight")).as("acc"))

+----+------------------------+
|name|acc                     |
+----+------------------------+
|john|[[carrot,4], [tomato,3]]|
|bill|[[taco,6], [apple,1]]   |
+----+------------------------+

{"name":"john","acc":[{"food":"carrot","weight":4},{"food":"tomato","weight":3}]}
{"name":"bill","acc":[{"food":"taco","weight":6},{"food":"apple","weight":1}]}

please give me right direction how to solve it.

2 Answers 2

1

You can always convert the values manually, by iterating over Rows, and assembling the food-weight pairs, and then converting them to a Map

val step1 = df.groupBy("name", "food").agg(sum("weight").as("weight")).
    groupBy("name").agg(collect_list(struct("food", "weight")).as("buy"))
val result = step1.map(row =>
    (row.getAs[String]("name"), row.getAs[Seq[Row]]("buy").map(map =>
        map.getAs[String]("food") -> map.getAs[Long]("weight")).toMap)
    ).toDF("name", "buy")
result.toJSON.show(false)

+---------------------------------------------+
|{"name":"john","buy":{"carrot":4,"tomato":3}}|
|{"name":"bill","buy":{"taco":6,"apple":1}}   |
+---------------------------------------------+
Sign up to request clarification or add additional context in comments.

1 Comment

Looks great. I thought about map, but didn’t really understood how exactly. I will check and update. Thanks!
0

You can achive your required json format by using replace techniques

udf way

udf function works on primitive data types so replace function can be used to replace the food and weight string from final dataframe as

import org.apache.spark.sql.functions._
def replaeUdf = udf((json: String) => json.replace("\"food\":", "").replace("\"weight\":", ""))

val temp = df.groupBy("name", "food").agg(sum("weight").as("weight"))
  .groupBy("name").agg(collect_list(struct(col("food"), col("weight"))).as("buy"))
  .toJSON.withColumn("value", replaeUdf(col("value")))

You should have output dataframe as

+-------------------------------------------------+
|value                                            |
+-------------------------------------------------+
|{"name":"john","buy":[{"carrot",4},{"tomato",3}]}|
|{"name":"bill","buy":[{"taco",6},{"apple",1}]}   |
+-------------------------------------------------+

regex_replace function

regex_replace inbuilt function can be used to get the desired output as well

val temp = df.groupBy("name", "food").agg(sum("weight").as("weight"))
  .groupBy("name").agg(collect_list(struct(col("food"), col("weight"))).as("buy"))
  .toJSON.withColumn("value", regexp_replace(regexp_replace(col("value"), "\"food\":", ""), "\"weight\":", ""))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.