How to create Dataset (not DataFrame) without using case class but using StructType?

Question

How can I create Dataset using StructType?

We can create a Dataset as follows:

case class Person(name: String, age: Int)

val personDS = Seq(Person("Max", 33), Person("Adam", 32), Person("Muller", 
62)).toDS()
personDS.show()

Is there a way to create a Dataset without using a case class?

I'd like to create a DataFrame using a case class and using StructType.

are you perhaps thinking of DataFrame? It's an alias for Dataset[Row] in spark2, and can be created using StructType to specify a schema — Davis Broda
– Davis Broda, Commented Sep 18, 2017 at 17:44
DataFrame = Dataset[Row], so if you know how to create DataFrame, you know how to create a dataset :) — T. Gawęda
– T. Gawęda, Commented Sep 18, 2017 at 17:49
@T.Gaweda ,if you see this method "spark.createDataset" there is no option can pass "StructType" and if you are trying to create Dataset using DataFrame still you need case class — Narendra Parmar
– Narendra Parmar, Commented Sep 18, 2017 at 18:08

Jacek Laskowski · Accepted Answer · 2017-09-19 06:38:37Z

7

If you know how to create DataFrame, you already now how to create Dataset :)

DataFrame = Dataset[Row].

What it means? Try:

val df : DataFrame = spark.createDataFrame(...) // with StructType
import org.apache.spark.sql._
val ds : Dataset[Row] = df; // no error, as DataFrame is only a type alias of Dataset[Row]

edited Sep 19, 2017 at 6:38

Jacek Laskowski

75k28 gold badges253 silver badges440 bronze badges

answered Sep 18, 2017 at 18:35

T. Gawęda

16.1k5 gold badges51 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jacek Laskowski · Accepted Answer · 2017-09-19 07:03:57Z

3

That's an interesting question in a sense that I don't see a reason why one would want it.

How can I create Dataset using "StructType"

I'd then ask a very similar question...

Why would you like to "trade" a case class with a StructType? What would that give you that a case class could not?

The reason you use a case class is that it can offer you two things at once:

Describe your schema quickly, nicely and type-safely
Working with your data becomes type-safe

Regarding 1. as a Scala developer, you will define business objects that describe your data. You will have to do it anyway (unless you like tuples and _1 and such).

Regarding type-safety (in both 1. and 2.) is about transforming your data to leverage the Scala compiler that can help find places where you expect a String but have an Int. With StructType the check is only at runtime (not compile time).

With all that said, the answer to your question is "Yes".

You can create a Dataset using StructType.

scala> val personDS = Seq(("Max", 33), ("Adam", 32), ("Muller", 62)).toDS
personDS: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]

scala> personDS.show
+------+---+
|    _1| _2|
+------+---+
|   Max| 33|
|  Adam| 32|
|Muller| 62|
+------+---+

You may be wondering why I don't see the column names. That's exactly the reason for a case class that would not only give you the types, but also the names of the columns.

There's one trick you can use however to avoid dealing with case classes if you don't like them.

val withNames = personDS.toDF("name", "age").as[(String, Int)]
scala> withNames.show
+------+---+
|  name|age|
+------+---+
|   Max| 33|
|  Adam| 32|
|Muller| 62|
+------+---+

answered Sep 19, 2017 at 7:03

Jacek Laskowski

75k28 gold badges253 silver badges440 bronze badges

1 Comment

Narendra Parmar Over a year ago

I Agree @jacek-laskowski with you that case class has benefit over StructType, but my motivation for asking this was , i was creating DataFrame for any data just defining schema for that data in conf and i was building StructType dynamically, based on conf schema, i was thinking same if can achieve through dataset

Powers · Accepted Answer · 2021-01-27 15:29:23Z

Here's how you can create the Dataset with a StructType:

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

val schema = StructType(Seq(
  StructField("name", StringType, true),
  StructField("age", IntegerType, true)
))

val data = Seq(
  Row("Max", 33),
  Row("Adam", 32),
  Row("Muller", 62)
)

val personDF = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  schema
)

val yourDS = personDF.as[(String, Int)]

yourDS.show()

+------+---+
|  name|age|
+------+---+
|   Max| 33|
|  Adam| 32|
|Muller| 62|
+------+---+

yourDS is a org.apache.spark.sql.Dataset[(String, Int)].

The personDS in your question is of type org.apache.spark.sql.Dataset[Person], so this doesn't quite give the same result.

See this post for more info on how to create Datasets.

Collectives™ on Stack Overflow

How to create Dataset (not DataFrame) without using case class but using StructType?

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related