Generic way to Parse Spark DataFrame to JSON Object/Array Using Spray JSON

Question

I'm trying to find a generic way(without using a concrete case class in Scala) to parse Spark DataFrame to JSON Object/Array using Spray JSON or any other library.

I have tried to approach this using spray-json and my current code looks something like this

import spray.json._
import spray.json.DefaultJsonProtocol._

val list = sc.parallelize(List(("a1","b1","c1","d1"),("a2","b2","c2","d2"))).toDF

list.show
+---+---+---+---+                                                               
| _1| _2| _3| _4|
+---+---+---+---+
| a1| b1| c1| d1|
| a2| b2| c2| d2|
+---+---+---+---+

val json = list.toJSON.collect.toJson.prettyPrint

println(json)

Current Output:

["{\"_1\":\"a1\",\"_2\":\"b1\",\"_3\":\"c1\",\"_4\":\"d1\"}", "{\"_1\":\"a2\",\"_2\":\"b2\",\"_3\":\"c2\",\"_4\":\"d2\"}"]

Expected Output:

[{
    "_1": "a1",
    "_2": "b1",
    "_3": "c1",
    "_4": "d1"
}, {
    "_1": "a2",
    "_2": "b2",
    "_3": "c2",
    "_4": "d2"
}]

Kindly suggest how to get the expected output in the required format without using a "concrete scala case class". Either using spray-json or any other library.

Your current implementation is List of a tuple. Maybe you created it for example. In your final implementation are you going to have List[List[String]] or it's going to be like this List[(String, String, String, String)] -> which your example is using. Because the format will make a difference in implementation — Aman Sehgal
– Aman Sehgal, Commented Oct 11, 2019 at 0:53
Yes, I have created this as an example. In the final implementation, DF.collect will return something like this "Array([a1,b1,c1,d1], [a2,b2,c2,d2])" — Manoj - GT
– Manoj - GT, Commented Oct 11, 2019 at 1:45

Aman Sehgal · Accepted Answer · 2019-10-11 06:27:09Z

2

I took help from an earlier post. If you would have had a look here, I think you would have got your answer.

You're correct half way through. By adding custom formatting code, you should be able to get your output in desired format.

import scala.util.parsing.json.JSON
import scala.util.parsing.json.JSONArray   
import scala.util.parsing.json.JSONFormat   
import scala.util.parsing.json.JSONObject   
import scala.util.parsing.json.JSONType

// Thanks to Senia for providing this in her solution
def format(t: Any, i: Int = 0): String = t match {
  case o: JSONObject =>
    o.obj.map{ case (k, v) =>
      "  "*(i+1) + JSONFormat.defaultFormatter(k) + ": " + format(v, i+1)
    }.mkString("{\n", ",\n", "\n" + "  "*i + "}")

  case a: JSONArray =>
    a.list.map{
      e => "  "*(i+1) + format(e, i+1)
    }.mkString("[\n", ",\n", "\n" + "  "*i + "]")

  case _ => JSONFormat defaultFormatter t
}

val list = sc.parallelize(List(("a1","b1","c1","d1"),("a2","b2","c2","d2"))).toDF

// Create array
val jsonArray = list.toJSON.collect()

val jsonFormattedArray = jsonArray.map(j => format(JSON.parseRaw(j).get))

res1: Array[String] =
Array({
  "_1": "a1",
  "_2": "b1",
  "_3": "c1",
  "_4": "d1"
}, {
  "_1": "a2",
  "_2": "b2",
  "_3": "c2",
  "_4": "d2"
})

Convert formatted Json to string

scala> jsonFormattedArray.toList.mkString(",")

res2: String =
{
  "_1": "a1",
  "_2": "b1",
  "_3": "c1",
  "_4": "d1"
},{
  "_1": "a2",
  "_2": "b2",
  "_3": "c2",
  "_4": "d2"
}

answered Oct 11, 2019 at 6:27

Aman Sehgal

5664 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Manoj - GT Over a year ago

The performance of .toList.mkString(",") is very bad. Should try some optimized approach if posible.

Manoj - GT Over a year ago

Finally jsonArray.mkString("[", ",", "]") did the trick. Thanks for your detail answer.

Manoj - GT · Accepted Answer · 2019-10-13 11:38:33Z

After trying various approach using various libraries, I finally settled with the below simple approach.

val list = sc.parallelize(List(("a1","b1","c1","d1"),("a2","b2","c2","d2"))).toDF

val jsonArray = list.toJSON.collect
/*jsonArray: Array[String] = Array({"_1":"a1","_2":"b1","_3":"c1","_4":"d1"}, {"_1":"a2","_2":"b2","_3":"c2","_4":"d2"})*/

val finalOutput = jsonArray.mkString("[", ",", "]")

/*finalOutput: String = [{"_1":"a2","_2":"b2","_3":"c2","_4":"d2"},{"_1":"a1","_2":"b1","_3":"c1","_4":"d1"}]*/

In this approach, we no need to use spray-JSON or any other library.

Special thanks to @Aman Sehgal. His answer helped me to come up with this optimal solution.

Note: I'm yet to analyze the performance of this approach using a large DF but with some basic performance testing it looks equally fast to ".toJson.prettyPrint" of "spray-json".

Collectives™ on Stack Overflow

Generic way to Parse Spark DataFrame to JSON Object/Array Using Spray JSON

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related