4

I followed the spark streaming guide and was able to get a sql context of my json data using sqlContext.read.json(rdd). The problem is that one of the json fields is a JSON string itself that I would like parsed.

Is there a way to accomplish this within spark sql, or would it be easier to use ObjectMapper to parse the string and join to the rest of the data?

To clarify, one of the values of the JSON is a string containing JSON data with the inner quotes escaped. I'm looking for a way to tell the parser to treat that value as stringified JSON

Example Json

{ 
  "key": "val",
  "jsonString": "{ \"too\": \"bad\" }",
  "jsonObj": { "ok": "great" }
}

How SQLContext Parses it

root
 |-- key: string (nullable = true)
 |-- jsonString: string (nullable = true)
 |-- jsonObj: struct (nullable = true)
 |    |-- ok: string (nullable = true)

How I would like it

root
 |-- key: string (nullable = true)
 |-- jsonString: struct (nullable = true)
 |    |-- too: string (nullable = true)
 |-- jsonObj: struct (nullable = true)
 |    |-- ok: string (nullable = true)
1
  • 1
    How about 2 steps? 1 parse to get jsonString as String 2 parse jsonString to get Object? Commented Jan 5, 2016 at 7:18

4 Answers 4

4

You can use the from_json funtion to parse the column of a DataSet:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

val stringified = spark.createDataset(Seq("{ \"too\": \"bad\" }", "{ \"too\": \"sad\" }"))
stringified.printSchema()

val structified = stringified.withColumn("value", from_json($"value", StructType(Seq(StructField("too", StringType, false)))))
structified.printSchema()

Which converts the value column from a string to a struct:

root
 |-- value: string (nullable = true)

root
 |-- value: struct (nullable = true)
 |    |-- too: string (nullable = false)
Sign up to request clarification or add additional context in comments.

1 Comment

Spark added json helpers to spark.sql.functions, so this is now the best approach using DataFrame APIs
2

Older RDD API Approach (see accepted answer for DataFrame API)

I ended up using Jackson to parse the json envelope, then again to parse the inner escaped string.

val parsedRDD = rdd.map(x => {

      // Get Jackson mapper
      val mapper = new ObjectMapper() with ScalaObjectMapper
      mapper.registerModule(DefaultScalaModule)

      // parse envelope
      val envelopeMap = mapper.readValue[Map[String,Any]](x)
      //println("the original envelopeMap", envelopeMap)

      // parse inner jsonString value
      val event = mapper.readValue[Map[String,Any]](envelopeMap.getOrElse("body", "").asInstanceOf[String])

      // get Map that includes parsed jsonString
      val parsed = envelopeMap.updated("jsonString", event)

      // write entire map as json string
      mapper.writeValueAsString(parsed)
})

val df = sqlContext.read.json(parsedRDD)

Now parsedRDD contains valid json and the dataframe properly infers the entire schema.

I think there must be a way to avoid having to serialize to json and parse again but so far I don't see any sqlContext APIs that operate on RDD[Map[String, Any]]

Comments

0

The json, you have provided is wrong, so fixed and giving you an example.

Lets take the json as below. {"key": "val","jsonString": {"too": "bad"},"jsonObj": {"ok": "great"}}

Spark SQL Json parser will allow you to read nested json as well, frankly if that is not provided, it would have been incomplete, coz you will see almost 99% nested jsons.

Coming to how to access it, you need to select using . . Here it is, jsonString.too or jsonObj.ok.

Below is the example to understand

scala> val df1 = sqlContext.read.json("/Users/srini/workspace/splunk_spark/file3.json").toDF
df1: org.apache.spark.sql.DataFrame = [jsonObj: struct<ok:string>, jsonString: struct<too:string>, key: string]

scala> df1.show
+-------+----------+---+
|jsonObj|jsonString|key|
+-------+----------+---+
|[great]|     [bad]|val|
+-------+----------+---+


scala> df1.select("jsonString.too");
res12: org.apache.spark.sql.DataFrame = [too: string]

scala> df1.select("jsonString.too").show
+---+
|too|
+---+
|bad|
+---+


scala> df1.select("jsonObj.ok").show
+-----+
|   ok|
+-----+
|great|
+-----+

Hope you can understand. Reply back, if you need any more info. Its just parent node. child node. thats it.

1 Comment

If the structure is fixed, it is also fine to go ahead with ObjectMapper as you mentioned. But, there is a way to access the nested objects. You can refer below link as well. databricks.com/blog/2015/02/02/…
0

Obviously

"jsonString": "{ \"too\": \"bad\" }"

is not valid json data, fix : and make sure entire string is valid json structure.

2 Comments

Should have been "jsonString": "{\"too\": \"bad\"}" which is valid json string property. I am wondering if there is any way to hint that it should be parsed as json
Seems that no such hint, spark sql will try to parse json syntax, if not success, json will be degenerated to string data type. It is expected behaviour.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.