2

I'm trying to modify a Dataframe which is generated by an external library. I receive a Dataframe with this schema:

root
 |-- child: struct (nullable = true)
 |    |-- child_id: long (nullable = true)

I would like to wrap the child struct above into an Array, as shown in the box below.

root
 |-- child: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- child_id: long (nullable = true)

I tried to define an UDF:

//the two lines below are an example, in real i get the Dataframe from an  external library. 
val seq = sc.parallelize(Seq("""{ "child": { "child_id": 1}}"""))
val df = sqlContext.read.json(seq)

val myUDF = udf((x: Row) => Array(x))
val df2 = df.withColumn("children",myUDF($"child"))

But i get an exception: "Schema for type org.apache.spark.sql.Row is not supported"

I'm working with Spark 2.1.1.

The real DataFrame is very complex, is there a solution that allow modifying the schema without listing the name or the position of the fields in the child table? For the same reason I also would rather not to map to explicit case classes.

Thank you in advance for any help!

1 Answer 1

6

You can use array inbuilt function to get your desired result as

import org.apache.spark.sql.functions._
val df2 = df.withColumn("child", array("child"))

this will update the same column, if you want it in separate column then do

import org.apache.spark.sql.functions._
val df2 = df.withColumn("children", array("child"))
Sign up to request clarification or add additional context in comments.

1 Comment

No :) I guess you are not. Thanks for the acceptance and upvotes :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.