Parse nested JSON stringified column in Spark Streaming SQL

Question

I followed the spark streaming guide and was able to get a sql context of my json data using sqlContext.read.json(rdd). The problem is that one of the json fields is a JSON string itself that I would like parsed.

Is there a way to accomplish this within spark sql, or would it be easier to use ObjectMapper to parse the string and join to the rest of the data?

To clarify, one of the values of the JSON is a string containing JSON data with the inner quotes escaped. I'm looking for a way to tell the parser to treat that value as stringified JSON

Example Json

{ 
  "key": "val",
  "jsonString": "{ \"too\": \"bad\" }",
  "jsonObj": { "ok": "great" }
}

How SQLContext Parses it

root
 |-- key: string (nullable = true)
 |-- jsonString: string (nullable = true)
 |-- jsonObj: struct (nullable = true)
 |    |-- ok: string (nullable = true)

How I would like it

root
 |-- key: string (nullable = true)
 |-- jsonString: struct (nullable = true)
 |    |-- too: string (nullable = true)
 |-- jsonObj: struct (nullable = true)
 |    |-- ok: string (nullable = true)

How about 2 steps? 1 parse to get jsonString as String 2 parse jsonString to get Object? — giaosudau
– giaosudau, Commented Jan 5, 2016 at 7:18

Nathan Walther · Accepted Answer · 2017-08-07 19:53:32Z

4

You can use the from_json funtion to parse the column of a DataSet:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

val stringified = spark.createDataset(Seq("{ \"too\": \"bad\" }", "{ \"too\": \"sad\" }"))
stringified.printSchema()

val structified = stringified.withColumn("value", from_json($"value", StructType(Seq(StructField("too", StringType, false)))))
structified.printSchema()

Which converts the value column from a string to a struct:

root
 |-- value: string (nullable = true)

root
 |-- value: struct (nullable = true)
 |    |-- too: string (nullable = false)

answered Aug 7, 2017 at 19:53

Nathan Walther

5945 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

smashbourne Over a year ago

Spark added json helpers to spark.sql.functions, so this is now the best approach using DataFrame APIs

smashbourne · Accepted Answer · 2017-08-08 15:07:01Z

Older RDD API Approach (see accepted answer for DataFrame API)

I ended up using Jackson to parse the json envelope, then again to parse the inner escaped string.

val parsedRDD = rdd.map(x => {

      // Get Jackson mapper
      val mapper = new ObjectMapper() with ScalaObjectMapper
      mapper.registerModule(DefaultScalaModule)

      // parse envelope
      val envelopeMap = mapper.readValue[Map[String,Any]](x)
      //println("the original envelopeMap", envelopeMap)

      // parse inner jsonString value
      val event = mapper.readValue[Map[String,Any]](envelopeMap.getOrElse("body", "").asInstanceOf[String])

      // get Map that includes parsed jsonString
      val parsed = envelopeMap.updated("jsonString", event)

      // write entire map as json string
      mapper.writeValueAsString(parsed)
})

val df = sqlContext.read.json(parsedRDD)

Now parsedRDD contains valid json and the dataframe properly infers the entire schema.

I think there must be a way to avoid having to serialize to json and parse again but so far I don't see any sqlContext APIs that operate on RDD[Map[String, Any]]

Srinivasarao Daruna · Accepted Answer · 2016-01-05 01:02:56Z

0

The json, you have provided is wrong, so fixed and giving you an example.

Lets take the json as below. {"key": "val","jsonString": {"too": "bad"},"jsonObj": {"ok": "great"}}

Spark SQL Json parser will allow you to read nested json as well, frankly if that is not provided, it would have been incomplete, coz you will see almost 99% nested jsons.

Coming to how to access it, you need to select using . . Here it is, jsonString.too or jsonObj.ok.

Below is the example to understand

scala> val df1 = sqlContext.read.json("/Users/srini/workspace/splunk_spark/file3.json").toDF
df1: org.apache.spark.sql.DataFrame = [jsonObj: struct<ok:string>, jsonString: struct<too:string>, key: string]

scala> df1.show
+-------+----------+---+
|jsonObj|jsonString|key|
+-------+----------+---+
|[great]|     [bad]|val|
+-------+----------+---+


scala> df1.select("jsonString.too");
res12: org.apache.spark.sql.DataFrame = [too: string]

scala> df1.select("jsonString.too").show
+---+
|too|
+---+
|bad|
+---+


scala> df1.select("jsonObj.ok").show
+-----+
|   ok|
+-----+
|great|
+-----+

Hope you can understand. Reply back, if you need any more info. Its just parent node. child node. thats it.

answered Jan 5, 2016 at 1:02

Srinivasarao Daruna

3,3748 gold badges31 silver badges71 bronze badges

1 Comment

Srinivasarao Daruna Over a year ago

If the structure is fixed, it is also fine to go ahead with ObjectMapper as you mentioned. But, there is a way to access the nested objects. You can refer below link as well. databricks.com/blog/2015/02/02/…

Shawn Guo · Accepted Answer · 2016-01-05 02:08:44Z

0

Obviously

"jsonString": "{ \"too\": \"bad\" }"

is not valid json data, fix : and make sure entire string is valid json structure.

edited Jan 5, 2016 at 2:08

answered Jan 5, 2016 at 0:41

Shawn Guo

3,2283 gold badges24 silver badges28 bronze badges

2 Comments

smashbourne Over a year ago

Should have been "jsonString": "{\"too\": \"bad\"}" which is valid json string property. I am wondering if there is any way to hint that it should be parsed as json

Shawn Guo Over a year ago

Seems that no such hint, spark sql will try to parse json syntax, if not success, json will be degenerated to string data type. It is expected behaviour.

Collectives™ on Stack Overflow

Parse nested JSON stringified column in Spark Streaming SQL

4 Answers 4

1 Comment

Comments

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related