3

I have a JSON data like this:

{  
   "parent":[  
      {  
         "prop1":1.0,
         "prop2":"C",
         "children":[  
            {  
               "child_prop1":[  
                  "3026"
               ]
            }
         ]
      }
   ]
}

After reading data from Spark I get following schema:

val df = spark.read.json("test.json")
df.printSchema
root
 |-- parent: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- children: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- child_prop1: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |-- prop1: double (nullable = true)
 |    |    |-- prop2: string (nullable = true)

Now, I want to select child_prop1 from df. But when I try to select it I get org.apache.spark.sql.AnalysisException. Something like this:

df.select("parent.children.child_prop1")
org.apache.spark.sql.AnalysisException: cannot resolve '`parent`.`children`['child_prop1']' due to data type mismatch: argument 2 requires integral type, however, ''child_prop1'' is of string type.;;
'Project [parent#60.children[child_prop1] AS child_prop1#63]
+- Relation[parent#60] json

  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292)
  at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:296)
  at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
  at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
  at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822)
  at org.apache.spark.sql.Dataset.select(Dataset.scala:1121)
  at org.apache.spark.sql.Dataset.select(Dataset.scala:1139)
  ... 48 elided

Although, when I select only children from df it works fine.

df.select("parent.children").show(false)
+------------------------------------+
|children                            |
+------------------------------------+
|[WrappedArray([WrappedArray(3026)])]|
+------------------------------------+

I cannot understand why it is giving exception even though the column is present in dataframe.

Any help is appreciated !

0

2 Answers 2

3

Your Json is a valid json which and I think you don't need to change your input data.

Use explode to get the data as

import org.apache.spark.sql.functions.explode

val data = spark.read.json("src/test/java/data.json")
val child = data.select(explode(data("parent.children"))).toDF("children")

child.select(explode(child("children.child_prop1"))).toDF("child_prop1").show()

If you can change the input data you can follow @ramesh suggestions

Sign up to request clarification or add additional context in comments.

Comments

1

If you look at the schema child_prop1 is inside nested array of root array parent. So we need to be able to define the position of the child_prop1 and thats what the error is suggesting you to define.
Converting your json format should do the trick.
Changing the json to

{"parent":{"prop1":1.0,"prop2":"C","children":{"child_prop1":["3026"]}}}

and applying the

df.select("parent.children.child_prop1").show(false)

will give output as

+-----------+
|child_prop1|
+-----------+
|[3026]     |
+-----------+

And
Changing the json to

{"parent":{"prop1":1.0,"prop2":"C","children":[{"child_prop1":["3026"]}]}}

and applying the

df.select("parent.children.child_prop1").show(false)

will result

+--------------------+
|child_prop1         |
+--------------------+
|[WrappedArray(3026)]|
+--------------------+

I hope the answer helps

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.