Serialize table to nested JSON using Apache Spark

Question

I have a set of records like the following sample

|ACCOUNTNO|VEHICLENUMBER|CUSTOMERID|
+---------+-------------+----------+
| 10003014|    MH43AJ411|  20000000|
| 10003014|    MH43AJ411|  20000001|
| 10003015|   MH12GZ3392|  20000002|

I want to parse into JSON and it should be look like this:

{
    "ACCOUNTNO":10003014,
    "VEHICLE": [
        { "VEHICLENUMBER":"MH43AJ411", "CUSTOMERID":20000000},
        { "VEHICLENUMBER":"MH43AJ411", "CUSTOMERID":20000001}
    ],
    "ACCOUNTNO":10003015,
    "VEHICLE": [
        { "VEHICLENUMBER":"MH12GZ3392", "CUSTOMERID":20000002}
    ]
}

I have written the program but failed to achieve the output.

package com.report.pack1.spark

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._


object sqltojson {

  def main(args:Array[String]) {
    System.setProperty("hadoop.home.dir", "C:/winutil/")
    val conf = new SparkConf().setAppName("SQLtoJSON").setMaster("local[*]")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._      
    val jdbcSqlConnStr = "jdbc:sqlserver://192.168.70.88;databaseName=ISSUER;user=bhaskar;password=welcome123;"      
    val jdbcDbTable = "[HISTORY].[TP_CUSTOMER_PREPAIDACCOUNTS]"
    val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> jdbcSqlConnStr,"dbtable" -> jdbcDbTable)).load()
    jdbcDF.registerTempTable("tp_customer_account")
    val res01 = sqlContext.sql("SELECT ACCOUNTNO, VEHICLENUMBER, CUSTOMERID FROM tp_customer_account GROUP BY ACCOUNTNO, VEHICLENUMBER, CUSTOMERID ORDER BY ACCOUNTNO ")
    res01.coalesce(1).write.json("D:/res01.json")      
  }
}

How can I serialize in the given format? Thanks in advance!

vindev · Accepted Answer · 2018-07-17 12:44:37Z

1

You can use struct and groupBy to get your desired result. Below is the code for same. I have commented the code whenever required.

val df = Seq((10003014,"MH43AJ411",20000000),
  (10003014,"MH43AJ411",20000001),
  (10003015,"MH12GZ3392",20000002)
).toDF("ACCOUNTNO","VEHICLENUMBER","CUSTOMERID")

df.show
//output
//+---------+-------------+----------+
//|ACCOUNTNO|VEHICLENUMBER|CUSTOMERID|
//+---------+-------------+----------+
//| 10003014|    MH43AJ411|  20000000|
//| 10003014|    MH43AJ411|  20000001|
//| 10003015|   MH12GZ3392|  20000002|
//+---------+-------------+----------+

//create a struct column then group by ACCOUNTNO column and finally convert DF to JSON
df.withColumn("VEHICLE",struct("VEHICLENUMBER","CUSTOMERID")).
  select("VEHICLE","ACCOUNTNO"). //only select reqired columns
  groupBy("ACCOUNTNO"). 
  agg(collect_list("VEHICLE").as("VEHICLE")). //for the same group create a list of vehicles
  toJSON. //convert to json
  show(false)

//output
//+------------------------------------------------------------------------------------------------------------------------------------------+
//|value                                                                                                                                     |
//+------------------------------------------------------------------------------------------------------------------------------------------+
//|{"ACCOUNTNO":10003014,"VEHICLE":[{"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":20000000},{"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":20000001}]}|
//|{"ACCOUNTNO":10003015,"VEHICLE":[{"VEHICLENUMBER":"MH12GZ3392","CUSTOMERID":20000002}]}                                                   |
//+------------------------------------------------------------------------------------------------------------------------------------------+

You can also write this dataframe to a file using same statement as you mentioned in question.

answered Jul 17, 2018 at 12:44

vindev

2,2802 gold badges15 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Bhaskar Das Over a year ago

Ok. Thanks. But it is coming like this {"value":"{\"ACCOUNTNO\":10003200,\"VEHICLE\":[{\"VEHICLENUMBER\":\"MH04FP4254\",\"CUSTOMERID\":20000287}]}"}

Bhaskar Das Over a year ago

Why \ Character? The second thing is many VEHICLENUMBER inside the list has not been combined, because many VEHICLENUMBER has duplicate values.

Bhaskar Das Over a year ago

The data is coming from a table in a remote SQL Server, ofcourse that one table contains over 3 millions of records and yes it has multiple data of this. That's why I asked you that should I add the fields after GROUPBY? If Yes then also I am getting one VEHICLENUMBER multiple times. The Output result you have shown in your Stackoverflow Answer I want to get that result actually. Yes the table contains duplicate data so should I use DISTINCT like something? Please help me my dear friend.

Bhaskar Das Over a year ago

The data/table you are taking thats not my input actually. I guess you have avoided my Scala code. The table I have shown in my question thats a result of a SQL Query of a table of over 3 millions records in a remote SQL Server. I have given this table for better understanding purpose

Bhaskar Das Over a year ago

are you there? i need to use groupby inside the fields of the list.

|

Collectives™ on Stack Overflow

Serialize table to nested JSON using Apache Spark

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related