How to create nested json using Apache Spark with Scala

Question

I am trying to create a nested JSON from my spark dataframe which has data in following structure.

Vendor_Name,count,Categories,Category_Count,Subcategory,Subcategory_Count
Vendor1,10,Category 1,4,Sub Category 1,1
Vendor1,10,Category 1,4,Sub Category 2,2
Vendor1,10,Category 1,4,Sub Category 3,3
Vendor1,10,Category 1,4,Sub Category 4,4

Required json output in below format using Apache-Spark with Scala.

[{
        "vendor_name": "Vendor 1",
        "count": 10,
        "categories": [{
            "name": "Category 1",
            "count": 4,
            "subCategories": [{
                    "name": "Sub Category 1",
                    "count": 1
                },
                {
                    "name": "Sub Category 2",
                    "count": 1
                },
                {
                    "name": "Sub Category 3",
                    "count": 1
                },
                {
                    "name": "Sub Category 4",
                    "count": 1
                }
            ]
        }]

Can you add the code for the dataframe creation, the schema, etc..? — Emiliano Martinez
– Emiliano Martinez, Commented Sep 25, 2019 at 10:57

NIKHIL SUTHAR · Accepted Answer · 2019-09-25 12:02:39Z

3

//read file into DataFrame    
scala> val df = spark.read.format("csv").option("header", "true").load(<input CSV path>)
    df: org.apache.spark.sql.DataFrame = [Vendor_Name: string, count: string ... 4 more fields]

    scala> df.show(false)
    +-----------+-----+----------+--------------+--------------+-----------------+
    |Vendor_Name|count|Categories|Category_Count|Subcategory   |Subcategory_Count|
    +-----------+-----+----------+--------------+--------------+-----------------+
    |Vendor1    |10   |Category 1|4             |Sub Category 1|1                |
    |Vendor1    |10   |Category 1|4             |Sub Category 2|2                |
    |Vendor1    |10   |Category 1|4             |Sub Category 3|3                |
    |Vendor1    |10   |Category 1|4             |Sub Category 4|4                |
    +-----------+-----+----------+--------------+--------------+-----------------+

    //convert into desire Json format
    scala> val df1 = df.groupBy("Vendor_Name","count","Categories","Category_Count").agg(collect_list(struct(col("Subcategory").alias("name"),col("Subcategory_Count").alias("count"))).alias("subCategories")).groupBy("Vendor_Name","count").agg(collect_list(struct(col("Categories").alias("name"),col("Category_Count").alias("count"),col("subCategories"))).alias("categories"))
    df1: org.apache.spark.sql.DataFrame = [Vendor_Name: string, count: string ... 1 more field]

    scala> df1.printSchema
    root
     |-- Vendor_Name: string (nullable = true)
     |-- count: string (nullable = true)
     |-- categories: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- name: string (nullable = true)
     |    |    |-- count: string (nullable = true)
     |    |    |-- subCategories: array (nullable = true)
     |    |    |    |-- element: struct (containsNull = true)
     |    |    |    |    |-- name: string (nullable = true)
     |    |    |    |    |-- count: string (nullable = true)

    //Write df in json format
    scala> df1.write.format("json").mode("append").save(<output Path>)

answered Sep 25, 2019 at 12:02

NIKHIL SUTHAR

2,4511 gold badge11 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Abhay Mishra Over a year ago

is there any documentation link or tutorial link , where I can understand collect_list, struct(), col() , agg() in more detail.

NickyPatel Over a year ago

Thank you so much for providing this. It is working for me. Is it possible for doing the same with window functions ?

Collectives™ on Stack Overflow

How to create nested json using Apache Spark with Scala

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related