0

I have a json string such as:

{"sequence":89,"id":8697344444103393,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527636408955},1],
{"sequence":155,"id":8697389381205360,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637858607},1],
{"sequence":136,"id":8697374208897843,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637405129},1],
{"sequence":189,"id":8697413135394406,"trackingInfo":{"row":0,"trackId":14272744,"requestId":"284929d9-6147-4924-a19f-4a308730354c-3348447","rank":0,"videoId":80075830,"location":"PostPlay\/Next"},"type":["Play","Action","Session"],"time":527638558756},1],
{"sequence":130,"id":8697373887446384,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527637394083}]

What would be the best approach here ? Ive tired

val rdd = sc.parallelize(Seq(jsonString)).flatMap(_.split("}"))
val trackingRdd = rdd.filter(_.contains("trackingInfo"))

An example output of this attempt is :

,{"sequence":89,"id":8697344444103393,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"

As you can see I nearlly have all the data I want except "type":["Play","Action","Session"],"time":527636408955},1] as I split on }

Any help is appreciated

1 Answer 1

1

We can read data with JSON structure, for example:

scala> val df=spark.read.json(sc.parallelize(Seq(jsonString))).select(explode(col("reverseDeltas"))).select(explode(col("col"))).map(_.getString(0)).filter(_.indexOf('{')>=0)
warning: there was one deprecation warning; re-run with -deprecation for details
df: org.apache.spark.sql.Dataset[String] = [value: string]

scala> spark.read.json(df).filter(col("trackingInfo").isNotNull).select("trackingInfo").toJSON.show(false)
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                                                            |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"trackingInfo":{"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","location":"Browse","rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171","row":0,"trackId":14170286,"videoId":80000778}}|
|{"trackingInfo":{"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","location":"Browse","rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171","row":0,"trackId":14170286,"videoId":80000778}}|
|{"trackingInfo":{"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","location":"Browse","rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171","row":0,"trackId":14170286,"videoId":80000778}}|
|{"trackingInfo":{"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","location":"Browse","rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171","row":0,"trackId":14170286,"videoId":80000778}}|
|{"trackingInfo":{"location":"PostPlay/Next","rank":0,"requestId":"284929d9-6147-4924-a19f-4a308730354c-3348447","row":0,"trackId":14272744,"videoId":80075830}}                                                                  |
|{"trackingInfo":{"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","location":"Browse","rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171","row":0,"trackId":14170286,"videoId":80000778}}|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+


scala> 
Sign up to request clarification or add additional context in comments.

4 Comments

thanks for your answer. Would it be possible to get the outer values around that record which has trackingInfo ? So I need sequence, id, type (just the first element) and time
actually I can get those by doing a simple filter after doing your first line and keeping it an rdd. And the doing df.filter(_.contains("trackingInfo")). An example result record looks like ; {"sequence":89,"id":8697344444103393,"trackingInfo":{"location":"Browse","row":0,"trackId":14170286,"listId":"cd7c2c7a-00f6-4035-867f-d1dd7d89972d_6625365X3XX1505943605585","videoId":80000778,"rank":0,"requestId":"ac12f4e1-5644-46af-87d1-ec3b92ce4896-4071171"},"type":["Play","Action","Session"],"time":527636408955}
would you mind explain the first line of your code ? sqlContext.read.json(sc.parallelize(Seq(jsonString))) .select(explode(col("reverseDeltas"))) .select(explode(col("col"))) .map(_.getString(0)) .filter(_.indexOf('{') >= 0)
After JSON data loaded as DataSet, we can use Spark SQL to query it using some SQL functions. FYI: spark.apache.org/docs/latest/api/java/org/apache/spark/sql/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.