0

I'm trying to find a generic way(without using a concrete case class in Scala) to parse Spark DataFrame to JSON Object/Array using Spray JSON or any other library.

I have tried to approach this using spray-json and my current code looks something like this

import spray.json._
import spray.json.DefaultJsonProtocol._

val list = sc.parallelize(List(("a1","b1","c1","d1"),("a2","b2","c2","d2"))).toDF

list.show
+---+---+---+---+                                                               
| _1| _2| _3| _4|
+---+---+---+---+
| a1| b1| c1| d1|
| a2| b2| c2| d2|
+---+---+---+---+

val json = list.toJSON.collect.toJson.prettyPrint

println(json)

Current Output:

["{\"_1\":\"a1\",\"_2\":\"b1\",\"_3\":\"c1\",\"_4\":\"d1\"}", "{\"_1\":\"a2\",\"_2\":\"b2\",\"_3\":\"c2\",\"_4\":\"d2\"}"]

Expected Output:

[{
    "_1": "a1",
    "_2": "b1",
    "_3": "c1",
    "_4": "d1"
}, {
    "_1": "a2",
    "_2": "b2",
    "_3": "c2",
    "_4": "d2"
}]

Kindly suggest how to get the expected output in the required format without using a "concrete scala case class". Either using spray-json or any other library.

2
  • Your current implementation is List of a tuple. Maybe you created it for example. In your final implementation are you going to have List[List[String]] or it's going to be like this List[(String, String, String, String)] -> which your example is using. Because the format will make a difference in implementation Commented Oct 11, 2019 at 0:53
  • Yes, I have created this as an example. In the final implementation, DF.collect will return something like this "Array([a1,b1,c1,d1], [a2,b2,c2,d2])" Commented Oct 11, 2019 at 1:45

2 Answers 2

2

I took help from an earlier post. If you would have had a look here, I think you would have got your answer.

You're correct half way through. By adding custom formatting code, you should be able to get your output in desired format.

import scala.util.parsing.json.JSON
import scala.util.parsing.json.JSONArray   
import scala.util.parsing.json.JSONFormat   
import scala.util.parsing.json.JSONObject   
import scala.util.parsing.json.JSONType

// Thanks to Senia for providing this in her solution
def format(t: Any, i: Int = 0): String = t match {
  case o: JSONObject =>
    o.obj.map{ case (k, v) =>
      "  "*(i+1) + JSONFormat.defaultFormatter(k) + ": " + format(v, i+1)
    }.mkString("{\n", ",\n", "\n" + "  "*i + "}")

  case a: JSONArray =>
    a.list.map{
      e => "  "*(i+1) + format(e, i+1)
    }.mkString("[\n", ",\n", "\n" + "  "*i + "]")

  case _ => JSONFormat defaultFormatter t
}

val list = sc.parallelize(List(("a1","b1","c1","d1"),("a2","b2","c2","d2"))).toDF

// Create array
val jsonArray = list.toJSON.collect()

val jsonFormattedArray = jsonArray.map(j => format(JSON.parseRaw(j).get))

res1: Array[String] =
Array({
  "_1": "a1",
  "_2": "b1",
  "_3": "c1",
  "_4": "d1"
}, {
  "_1": "a2",
  "_2": "b2",
  "_3": "c2",
  "_4": "d2"
})

Convert formatted Json to string

scala> jsonFormattedArray.toList.mkString(",")

res2: String =
{
  "_1": "a1",
  "_2": "b1",
  "_3": "c1",
  "_4": "d1"
},{
  "_1": "a2",
  "_2": "b2",
  "_3": "c2",
  "_4": "d2"
}
Sign up to request clarification or add additional context in comments.

2 Comments

The performance of .toList.mkString(",") is very bad. Should try some optimized approach if posible.
Finally jsonArray.mkString("[", ",", "]") did the trick. Thanks for your detail answer.
0

After trying various approach using various libraries, I finally settled with the below simple approach.

val list = sc.parallelize(List(("a1","b1","c1","d1"),("a2","b2","c2","d2"))).toDF

val jsonArray = list.toJSON.collect
/*jsonArray: Array[String] = Array({"_1":"a1","_2":"b1","_3":"c1","_4":"d1"}, {"_1":"a2","_2":"b2","_3":"c2","_4":"d2"})*/

val finalOutput = jsonArray.mkString("[", ",", "]")

/*finalOutput: String = [{"_1":"a2","_2":"b2","_3":"c2","_4":"d2"},{"_1":"a1","_2":"b1","_3":"c1","_4":"d1"}]*/

In this approach, we no need to use spray-JSON or any other library.

Special thanks to @Aman Sehgal. His answer helped me to come up with this optimal solution.

Note: I'm yet to analyze the performance of this approach using a large DF but with some basic performance testing it looks equally fast to ".toJson.prettyPrint" of "spray-json".

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.