4

How can I create Dataset using StructType?

We can create a Dataset as follows:

case class Person(name: String, age: Int)

val personDS = Seq(Person("Max", 33), Person("Adam", 32), Person("Muller", 
62)).toDS()
personDS.show()

Is there a way to create a Dataset without using a case class?

I'd like to create a DataFrame using a case class and using StructType.

3
  • are you perhaps thinking of DataFrame? It's an alias for Dataset[Row] in spark2, and can be created using StructType to specify a schema Commented Sep 18, 2017 at 17:44
  • DataFrame = Dataset[Row], so if you know how to create DataFrame, you know how to create a dataset :) Commented Sep 18, 2017 at 17:49
  • @T.Gaweda ,if you see this method "spark.createDataset" there is no option can pass "StructType" and if you are trying to create Dataset using DataFrame still you need case class Commented Sep 18, 2017 at 18:08

3 Answers 3

7

If you know how to create DataFrame, you already now how to create Dataset :)

DataFrame = Dataset[Row].

What it means? Try:

val df : DataFrame = spark.createDataFrame(...) // with StructType
import org.apache.spark.sql._
val ds : Dataset[Row] = df; // no error, as DataFrame is only a type alias of Dataset[Row]
Sign up to request clarification or add additional context in comments.

Comments

3

That's an interesting question in a sense that I don't see a reason why one would want it.

How can I create Dataset using "StructType"

I'd then ask a very similar question...

Why would you like to "trade" a case class with a StructType? What would that give you that a case class could not?

The reason you use a case class is that it can offer you two things at once:

  1. Describe your schema quickly, nicely and type-safely

  2. Working with your data becomes type-safe

Regarding 1. as a Scala developer, you will define business objects that describe your data. You will have to do it anyway (unless you like tuples and _1 and such).

Regarding type-safety (in both 1. and 2.) is about transforming your data to leverage the Scala compiler that can help find places where you expect a String but have an Int. With StructType the check is only at runtime (not compile time).

With all that said, the answer to your question is "Yes".

You can create a Dataset using StructType.

scala> val personDS = Seq(("Max", 33), ("Adam", 32), ("Muller", 62)).toDS
personDS: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]

scala> personDS.show
+------+---+
|    _1| _2|
+------+---+
|   Max| 33|
|  Adam| 32|
|Muller| 62|
+------+---+

You may be wondering why I don't see the column names. That's exactly the reason for a case class that would not only give you the types, but also the names of the columns.

There's one trick you can use however to avoid dealing with case classes if you don't like them.

val withNames = personDS.toDF("name", "age").as[(String, Int)]
scala> withNames.show
+------+---+
|  name|age|
+------+---+
|   Max| 33|
|  Adam| 32|
|Muller| 62|
+------+---+

1 Comment

I Agree @jacek-laskowski with you that case class has benefit over StructType, but my motivation for asking this was , i was creating DataFrame for any data just defining schema for that data in conf and i was building StructType dynamically, based on conf schema, i was thinking same if can achieve through dataset
0

Here's how you can create the Dataset with a StructType:

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

val schema = StructType(Seq(
  StructField("name", StringType, true),
  StructField("age", IntegerType, true)
))

val data = Seq(
  Row("Max", 33),
  Row("Adam", 32),
  Row("Muller", 62)
)

val personDF = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  schema
)

val yourDS = personDF.as[(String, Int)]

yourDS.show()
+------+---+
|  name|age|
+------+---+
|   Max| 33|
|  Adam| 32|
|Muller| 62|
+------+---+

yourDS is a org.apache.spark.sql.Dataset[(String, Int)].

The personDS in your question is of type org.apache.spark.sql.Dataset[Person], so this doesn't quite give the same result.

See this post for more info on how to create Datasets.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.