How to create an empty DataFrame with a specified schema?

Question

I want to create on DataFrame with a specified schema in Scala. I have tried to use JSON read (I mean reading empty file) but I don't think that's the best practice.

zero323 · Accepted Answer · 2018-05-09 12:43:10Z

159

Lets assume you want a data frame with the following schema:

root
 |-- k: string (nullable = true)
 |-- v: integer (nullable = false)

You simply define schema for a data frame and use empty RDD[Row]:

import org.apache.spark.sql.types.{
    StructType, StructField, StringType, IntegerType}
import org.apache.spark.sql.Row

val schema = StructType(
    StructField("k", StringType, true) ::
    StructField("v", IntegerType, false) :: Nil)

// Spark < 2.0
// sqlContext.createDataFrame(sc.emptyRDD[Row], schema) 
spark.createDataFrame(sc.emptyRDD[Row], schema)

PySpark equivalent is almost identical:

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([
    StructField("k", StringType(), True), StructField("v", IntegerType(), False)
])

# or df = sc.parallelize([]).toDF(schema)

# Spark < 2.0 
# sqlContext.createDataFrame([], schema)
df = spark.createDataFrame([], schema)

Using implicit encoders (Scala only) with Product types like Tuple:

import spark.implicits._

Seq.empty[(String, Int)].toDF("k", "v")

or case class:

case class KV(k: String, v: Int)

Seq.empty[KV].toDF

or

spark.emptyDataset[KV].toDF

edited May 9, 2018 at 12:43

answered Jul 17, 2015 at 14:54

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Lucas Lima Over a year ago

This is the most appropriate answer - complete, and also useful if you want to reproduce the schema of an existing dataset quickly. I don't know why is it not the accepted one.

supernatural Over a year ago

How to create the df with the trait instead of case class: stackoverflow.com/questions/64276952/…

zero323 · Accepted Answer · 2017-09-19 10:12:33Z

48

As of Spark 2.0.0, you can do the following.

Case Class

Let's define a Person case class:

scala> case class Person(id: Int, name: String)
defined class Person

Import spark SparkSession implicit Encoders:

scala> import spark.implicits._
import spark.implicits._

And use SparkSession to create an empty Dataset[Person]:

scala> spark.emptyDataset[Person]
res0: org.apache.spark.sql.Dataset[Person] = [id: int, name: string]

Schema DSL

You could also use a Schema "DSL" (see Support functions for DataFrames in org.apache.spark.sql.ColumnName).

scala> val id = $"id".int
id: org.apache.spark.sql.types.StructField = StructField(id,IntegerType,true)

scala> val name = $"name".string
name: org.apache.spark.sql.types.StructField = StructField(name,StringType,true)

scala> import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructType

scala> val mySchema = StructType(id :: name :: Nil)
mySchema: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,true), StructField(name,StringType,true))

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> val emptyDF = spark.createDataFrame(sc.emptyRDD[Row], mySchema)
emptyDF: org.apache.spark.sql.DataFrame = [id: int, name: string]

scala> emptyDF.printSchema
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)

edited Sep 19, 2017 at 10:12

zero323

331k108 gold badges982 silver badges958 bronze badges

answered Aug 16, 2016 at 7:07

Jacek Laskowski

75k28 gold badges253 silver badges440 bronze badges

4 Comments

Peter Krauss Over a year ago

Hi, the compiler say that spark.emptyDataset not exist on my module, How to use it? there are some (correct) similar to (non-correct) val df = apache.spark.emptyDataset[RawData]?

Jacek Laskowski Over a year ago

@PeterKrauss spark is the value you create using SparkSession.builder not part of org.apache.spark package. There are two spark names in use. It's the spark you have available in spark-shell out of the box.

Peter Krauss Over a year ago

Thanks Jacek. I corrected: the SparkSession.builder object is passed as parameter (seems the best solution) from first general initialization, now is running.

supernatural Over a year ago

Is there a way to create the empty dataframe using trait instead of case class : stackoverflow.com/questions/64276952/…

Ramvignesh · Accepted Answer · 2020-04-18 23:40:19Z

Java version to create empty DataSet:

public Dataset<Row> emptyDataSet(){

    SparkSession spark = SparkSession.builder().appName("Simple Application")
                .config("spark.master", "local").getOrCreate();

    Dataset<Row> emptyDataSet = spark.createDataFrame(new ArrayList<>(), getSchema());

    return emptyDataSet;
}

public StructType getSchema() {

    String schemaString = "column1 column2 column3 column4 column5";

    List<StructField> fields = new ArrayList<>();

    StructField indexField = DataTypes.createStructField("column0", DataTypes.LongType, true);
    fields.add(indexField);

    for (String fieldName : schemaString.split(" ")) {
        StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
        fields.add(field);
    }

    StructType schema = DataTypes.createStructType(fields);

    return schema;
}

Murmel · Accepted Answer · 2017-10-31 11:29:08Z

Here you can create schema using StructType in scala and pass the Empty RDD so you will able to create empty table. Following code is for the same.

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.BooleanType
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.types.StringType



//import org.apache.hadoop.hive.serde2.objectinspector.StructField

object EmptyTable extends App {
  val conf = new SparkConf;
  val sc = new SparkContext(conf)
  //create sparksession object
  val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()

  //Created schema for three columns 
   val schema = StructType(
    StructField("Emp_ID", LongType, true) ::
      StructField("Emp_Name", StringType, false) ::
      StructField("Emp_Salary", LongType, false) :: Nil)

      //Created Empty RDD 

  var dataRDD = sc.emptyRDD[Row]

  //pass rdd and schema to create dataframe
  val newDFSchema = sparkSession.createDataFrame(dataRDD, schema)

  newDFSchema.createOrReplaceTempView("tempSchema")

  sparkSession.sql("create table Finaltable AS select * from tempSchema")

}

dirceusemighini · Accepted Answer · 2016-11-17 13:57:34Z

3

import scala.reflect.runtime.{universe => ru}
def createEmptyDataFrame[T: ru.TypeTag] =
    hiveContext.createDataFrame(sc.emptyRDD[Row],
      ScalaReflection.schemaFor(ru.typeTag[T].tpe).dataType.asInstanceOf[StructType]
    )
  case class RawData(id: String, firstname: String, lastname: String, age: Int)
  val sourceDF = createEmptyDataFrame[RawData]

edited Nov 17, 2016 at 13:57

dirceusemighini

1,3642 gold badges17 silver badges36 bronze badges

answered Sep 19, 2016 at 10:21

Ravindra

1431 silver badge9 bronze badges

Comments

braj · Accepted Answer · 2016-12-05 09:22:48Z

3

Here is a solution that creates an empty dataframe in pyspark 2.0.0 or more.

from pyspark.sql import SQLContext
sc = spark.sparkContext
schema = StructType([StructField('col1', StringType(),False),StructField('col2', IntegerType(), True)])
sqlContext.createDataFrame(sc.emptyRDD(), schema)

answered Dec 5, 2016 at 9:22

braj

2,7313 gold badges35 silver badges40 bronze badges

Comments

ss301 · Accepted Answer · 2020-09-10 15:35:21Z

3

This is helpful for testing purposes.

Seq.empty[String].toDF()

answered Sep 10, 2020 at 15:35

ss301

6821 gold badge11 silver badges25 bronze badges

1 Comment

supernatural Over a year ago

How to create empty df from trait instead :stackoverflow.com/questions/64276952/…

Krishna Tapse · Accepted Answer · 2024-09-27 06:22:10Z

2

#Create Empty DataFrame using spark.createDataFrame and pass empty list and schema

from pyspark.sql.types import StructType,StructField
from pyspark.sql.functions import StringType
schema = StructType([
    StructField('table_name',StringType()),
    StructField('row_cnt',StringType()),
])
df = spark.createDataFrame([],schema)
display(df)

answered Sep 27, 2024 at 6:22

Krishna Tapse

312 bronze badges

1 Comment

M.S.Visser Sep 10 at 15:26

In Scala you need to provide the empty RDD row: val df = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)

Kity Cartman · Accepted Answer · 2020-11-22 17:30:05Z

1

I had a special requirement wherein I already had a dataframe but given a certain condition I had to return an empty dataframe so I returned df.limit(0) instead.

answered Nov 22, 2020 at 17:30

Kity Cartman

9152 gold badges12 silver badges33 bronze badges

Comments

ZygD · Accepted Answer · 2022-06-20 19:50:42Z

0

I'd like to add the following syntax which was not yet mentioned:

Seq[(String, Integer)]().toDF("k", "v")

It makes it clear that the () part is for values. It's empty, so the dataframe is empty.

This syntax is also beneficial for adding null values manually. It just works, while other options either don't or are overly verbose.

answered Jun 20, 2022 at 19:50

ZygD

24.8k41 gold badges106 silver badges144 bronze badges

Comments

Zack · Accepted Answer · 2023-12-20 18:55:15Z

0

We were having issues with the emptyRDD method after converting to Spark 13.3 / enabling Unity Catalog in Databricks. The below solution works as a replacement for both.

import org.apache.spark.sql.types.{StructType, StringType}
import org.apache.spark.sql.Row
import java.util.ArrayList

val schema = new StructType()
  .add("column1", StringType, true)
  .add("column2", StringType, true)

val df = spark.createDataFrame(
  new ArrayList[Row],
  schema
)
df.count()

answered Dec 20, 2023 at 18:55

Zack

2,4763 gold badges23 silver badges29 bronze badges

Comments

Hsiao L · Accepted Answer · 2019-07-17 00:51:06Z

-3

As of Spark 2.4.3

val df = SparkSession.builder().getOrCreate().emptyDataFrame

answered Jul 17, 2019 at 0:51

Hsiao L

593 bronze badges

1 Comment

Andrew Sklyarevsky Over a year ago

This does not solve the schema part of the question.

Collectives™ on Stack Overflow

How to create an empty DataFrame with a specified schema?

12 Answers 12

2 Comments

Case Class

Schema DSL

4 Comments

Comments

Comments

Comments

Comments

1 Comment

1 Comment

Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

12 Answers 12

2 Comments

Case Class

Schema DSL

4 Comments

Comments

Comments

Comments

Comments

1 Comment

1 Comment

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related