How to create an empty DataFrame? Why "ValueError: RDD is empty"?

Question

I am trying to create an empty dataframe in Spark (Pyspark).

I am using similar approach to the one discussed here enter link description here, but it is not working.

This is my code

df = sqlContext.createDataFrame(sc.emptyRDD(), schema)

This is the error

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 404, in createDataFrame
rdd, schema = self._createFromRDD(data, schema, samplingRatio)
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 285, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 229, in _inferSchema
first = rdd.first()
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1320, in first
raise ValueError("RDD is empty")
ValueError: RDD is empty

Community · Accepted Answer · 2017-05-23 11:47:00Z

44

extending Joe Widen's answer, you can actually create the schema with no fields like so:

schema = StructType([])

so when you create the DataFrame using that as your schema, you'll end up with a DataFrame[].

>>> empty = sqlContext.createDataFrame(sc.emptyRDD(), schema)
DataFrame[]
>>> empty.schema
StructType(List())

In Scala, if you choose to use sqlContext.emptyDataFrame and check out the schema, it will return StructType().

scala> val empty = sqlContext.emptyDataFrame
empty: org.apache.spark.sql.DataFrame = []

scala> empty.schema
res2: org.apache.spark.sql.types.StructType = StructType()

edited May 23, 2017 at 11:47

CommunityBot

11 silver badge

answered Jan 6, 2016 at 4:02

Ton Torres

1,52913 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

villoro · Accepted Answer · 2020-02-09 00:26:07Z

29

At the time this answer was written it looks like you need some sort of schema

from pyspark.sql.types import *
field = [StructField("field1", StringType(), True)]
schema = StructType(field)

sc = spark.sparkContext
sqlContext.createDataFrame(sc.emptyRDD(), schema)

edited Feb 9, 2020 at 0:26

villoro

1,5491 gold badge11 silver badges15 bronze badges

answered Jan 6, 2016 at 2:44

Joe Widen

2,4681 gold badge19 silver badges21 bronze badges

2 Comments

Mateusz Dymczyk Over a year ago

Could you provide some source proving this claim?

Joe Widen Over a year ago

Looks like its not necessary actually. Just took a look at the API information for createDataFrame and it shows the schema defaults to none, so there should be a way to create a dataframe with no schema: spark.apache.org/docs/latest/api/python/pyspark.sql.html

braj · Accepted Answer · 2016-12-05 09:32:19Z

10

This will work with spark version 2.0.0 or more

from pyspark.sql import SQLContext
sc = spark.sparkContext
schema = StructType([StructField('col1', StringType(), False),StructField('col2', IntegerType(), True)])
sqlContext.createDataFrame(sc.emptyRDD(), schema)

answered Dec 5, 2016 at 9:32

braj

2,7313 gold badges35 silver badges40 bronze badges

3 Comments

makansij Over a year ago

what part of this only works for 2.0 or more? should work in 1.6.1 right @braj259?

braj Over a year ago

the spark intialization part. from 2.0 onwards thereis just one spark context for everything. so intialization is syntactically little different

makansij Over a year ago

but if you change sc = spark.sparkContext to sc = sparkContext() then i think it should be compatible with 1.6.x right?

Garren S · Accepted Answer · 2019-06-20 18:16:43Z

4

spark.range(0).drop("id")

This creates a DataFrame with an "id" column and no rows then drops the "id" column, leaving you with a truly empty DataFrame.

answered Jun 20, 2019 at 18:16

Garren S

5,8223 gold badges34 silver badges45 bronze badges

Comments

Prasanna Saraswathi Krishnan · Accepted Answer · 2020-09-25 07:24:10Z

2

If you want an empty dataframe based on an existing one, simple limit rows to 0. In PySpark :

emptyDf = existingDf.limit(0)

answered Sep 25, 2020 at 7:24

Prasanna Saraswathi Krishnan

6891 gold badge5 silver badges17 bronze badges

Comments

morienor · Accepted Answer · 2018-08-31 11:17:25Z

1

You can just use something like this:

   pivot_table = sparkSession.createDataFrame([("99","99")], ["col1","col2"])

answered Aug 31, 2018 at 11:17

morienor

3791 gold badge4 silver badges8 bronze badges

Comments

MahakGoyal · Accepted Answer · 2020-11-25 13:12:27Z

1

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType

spark = SparkSession.builder.appName('SparkPractice').getOrCreate()

schema = StructType([
  StructField('firstname', StringType(), True),
  StructField('middlename', StringType(), True),
  StructField('lastname', StringType(), True)
  ])

df = spark.createDataFrame(spark.sparkContext.emptyRDD(),schema)
df.printSchema()

answered Nov 25, 2020 at 13:12

MahakGoyal

113 bronze badges

Comments

Gerard G · Accepted Answer · 2020-04-11 06:03:47Z

0

This is a roundabout but simple way to create an empty spark df with an inferred schema

# Initialize a spark df using one row of data with the desired schema   
init_sdf = spark.createDataFrame([('a_string', 0, 0)], ['name', 'index', 'seq_#'])
# remove the row.  Leaves the schema
empty_sdf = init_sdf.where(col('name') == 'not_match')  
empty_sdf.printSchema()
# Output
root
 |-- name: string (nullable = true)
 |-- index: long (nullable = true)
 |-- seq_#: long (nullable = true)

answered Apr 11, 2020 at 6:03

Gerard G

3013 silver badges5 bronze badges

Comments

ss301 · Accepted Answer · 2020-09-10 15:33:38Z

0

Seq.empty[String].toDF()

This will create a empty df. Helpful for testing purposes and all. (Scala-Spark)

answered Sep 10, 2020 at 15:33

ss301

6821 gold badge11 silver badges25 bronze badges

Comments

Michael Rice · Accepted Answer · 2021-12-13 18:40:56Z

0

In Spark 3.1.2, the spark.sparkContext.emptyRDD() function throws an error. Using the schema, passing an empty list will work:

df = spark.createDataFrame([], schema)

answered Dec 13, 2021 at 18:40

Michael Rice

722 bronze badges

Comments

Mateusz Dymczyk · Accepted Answer · 2016-01-06 03:13:45Z

-1

You can do it by loading an empty file (parquet, json etc.) like this:

df = sqlContext.read.json("my_empty_file.json")

Then when you try to check the schema you'll see:

>>> df.printSchema()
root

In Scala/Java not passing a path should work too, in Python it throws an exception. Also if you ever switch to Scala/Python you can use this method to create one.

edited Jan 6, 2016 at 3:13

answered Jan 6, 2016 at 3:08

Mateusz Dymczyk

15.2k10 gold badges63 silver badges94 bronze badges

Comments

Adrian Mole · Accepted Answer · 2019-10-06 08:03:48Z

-2

You can create an empty data frame by using following syntax in pyspark:

df = spark.createDataFrame([], ["col1", "col2", ...])

where [] represents the empty value for col1 and col2. Then you can register as temp view for your sql queries:

**df2.createOrReplaceTempView("artist")**

edited Oct 6, 2019 at 8:03

Adrian Mole

52.1k193 gold badges61 silver badges101 bronze badges

answered Oct 6, 2019 at 7:43

asheesh kumar singhal

1451 silver badge4 bronze badges

1 Comment

Sonal Dubey Over a year ago

It says "Cannot infer schema from empty dataframe"

Collectives™ on Stack Overflow

How to create an empty DataFrame? Why "ValueError: RDD is empty"?

12 Answers 12

Comments

2 Comments

3 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

12 Answers 12

Comments

2 Comments

3 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related