1

I'm trying to create a dataframe to feed to a function as part of my unit tests. If I have the following

val myDf = sparkSession.sqlContext.createDataFrame(
  sparkSession.sparkContext.parallelize(Seq(
    Row(Some(Seq(MyObject(1024, 100001D), MyObject(1, -1D)))))),
    StructType(List(
      StructField("myList", ArrayType[???], true)
    )))

MyObject is a case class.

I don't know what to put for the object type. Any suggestions? I've tried ArrayType of pretty much every combination I can think of.

I'm looking for a dataframe that looks something like:

+--------------------+
|   myList           |
+--------------------+
| [1024, 100001]     |
| [1, -1]            |
+--------------------+
0

2 Answers 2

1

Coming in the reverse way...

val s = Seq(Array(1024, 100001D), Array(1, -1D)).toDS().toDF("myList")
println(s.schema)
s.printSchema
s.show

Your schema is like below... DoubleType is coming since these 100001D and -1D are double.

StructType(StructField(myList,ArrayType(DoubleType,false),true))

Output you needed:

root
 |-- myList: array (nullable = true)
 |    |-- element: double (containsNull = false)

+------------------+
|             myList|
+------------------+
|[1024.0, 100001.0]|
|       [1.0, -1.0]|
+------------------+

Or this way also you can do that.

case class MyObject(a:Int , b:Double)

val s = Seq(MyObject(1024, 100001D), MyObject(1, -1D)).toDS()
  .select(struct($"a",$"b").as[MyObject] as "myList")
println(s.schema)
s.printSchema
s.show

Result:

//schema :
StructType(StructField(myList,StructType(StructField(a,IntegerType,false), StructField(b,DoubleType,false)),false))

root
 |-- myList: struct (nullable = false)
 |    |-- a: integer (nullable = false)
 |    |-- b: double (nullable = false)

+----------------+
|          myList|
+----------------+
|[1024, 100001.0]|
|       [1, -1.0]|
+----------------+
Sign up to request clarification or add additional context in comments.

Comments

0

Try this

scala> case class MyObject(prop1:Int, prop2:Double)
defined class MyObject

scala> val df = Seq((1024, 100001D), (1, -1D)).toDF("prop1","prop2").select(struct($"prop1",$"prop2").as[MyObject] as "myList")
df: org.apache.spark.sql.DataFrame = [myList: struct<prop1: int, prop2: double>]

scala> df.show(false)
+----------------+
|myList          |
+----------------+
|[1024, 100001.0]|
|[1, -1.0]       |
+----------------+


scala> df.printSchema
root
 |-- myList: struct (nullable = false)
 |    |-- prop1: integer (nullable = false)
 |    |-- prop2: double (nullable = false)


Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.