35

I am trying to create an empty dataframe in Spark (Pyspark).

I am using similar approach to the one discussed here enter link description here, but it is not working.

This is my code

df = sqlContext.createDataFrame(sc.emptyRDD(), schema)

This is the error

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 404, in createDataFrame
rdd, schema = self._createFromRDD(data, schema, samplingRatio)
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 285, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 229, in _inferSchema
first = rdd.first()
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1320, in first
raise ValueError("RDD is empty")
ValueError: RDD is empty

12 Answers 12

44

extending Joe Widen's answer, you can actually create the schema with no fields like so:

schema = StructType([])

so when you create the DataFrame using that as your schema, you'll end up with a DataFrame[].

>>> empty = sqlContext.createDataFrame(sc.emptyRDD(), schema)
DataFrame[]
>>> empty.schema
StructType(List())

In Scala, if you choose to use sqlContext.emptyDataFrame and check out the schema, it will return StructType().

scala> val empty = sqlContext.emptyDataFrame
empty: org.apache.spark.sql.DataFrame = []

scala> empty.schema
res2: org.apache.spark.sql.types.StructType = StructType()    
Sign up to request clarification or add additional context in comments.

Comments

29

At the time this answer was written it looks like you need some sort of schema

from pyspark.sql.types import *
field = [StructField("field1", StringType(), True)]
schema = StructType(field)

sc = spark.sparkContext
sqlContext.createDataFrame(sc.emptyRDD(), schema)

2 Comments

Could you provide some source proving this claim?
Looks like its not necessary actually. Just took a look at the API information for createDataFrame and it shows the schema defaults to none, so there should be a way to create a dataframe with no schema: spark.apache.org/docs/latest/api/python/pyspark.sql.html
10

This will work with spark version 2.0.0 or more

from pyspark.sql import SQLContext
sc = spark.sparkContext
schema = StructType([StructField('col1', StringType(), False),StructField('col2', IntegerType(), True)])
sqlContext.createDataFrame(sc.emptyRDD(), schema)

3 Comments

what part of this only works for 2.0 or more? should work in 1.6.1 right @braj259?
the spark intialization part. from 2.0 onwards thereis just one spark context for everything. so intialization is syntactically little different
but if you change sc = spark.sparkContext to sc = sparkContext() then i think it should be compatible with 1.6.x right?
4
spark.range(0).drop("id")

This creates a DataFrame with an "id" column and no rows then drops the "id" column, leaving you with a truly empty DataFrame.

Comments

2

If you want an empty dataframe based on an existing one, simple limit rows to 0. In PySpark :

emptyDf = existingDf.limit(0)

Comments

1

You can just use something like this:

   pivot_table = sparkSession.createDataFrame([("99","99")], ["col1","col2"])

Comments

1
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType

spark = SparkSession.builder.appName('SparkPractice').getOrCreate()

schema = StructType([
  StructField('firstname', StringType(), True),
  StructField('middlename', StringType(), True),
  StructField('lastname', StringType(), True)
  ])

df = spark.createDataFrame(spark.sparkContext.emptyRDD(),schema)
df.printSchema()

Comments

0

This is a roundabout but simple way to create an empty spark df with an inferred schema

# Initialize a spark df using one row of data with the desired schema   
init_sdf = spark.createDataFrame([('a_string', 0, 0)], ['name', 'index', 'seq_#'])
# remove the row.  Leaves the schema
empty_sdf = init_sdf.where(col('name') == 'not_match')  
empty_sdf.printSchema()
# Output
root
 |-- name: string (nullable = true)
 |-- index: long (nullable = true)
 |-- seq_#: long (nullable = true)

Comments

0
Seq.empty[String].toDF()

This will create a empty df. Helpful for testing purposes and all. (Scala-Spark)

Comments

0

In Spark 3.1.2, the spark.sparkContext.emptyRDD() function throws an error. Using the schema, passing an empty list will work:

df = spark.createDataFrame([], schema)

Comments

-1

You can do it by loading an empty file (parquet, json etc.) like this:

df = sqlContext.read.json("my_empty_file.json")

Then when you try to check the schema you'll see:

>>> df.printSchema()
root

In Scala/Java not passing a path should work too, in Python it throws an exception. Also if you ever switch to Scala/Python you can use this method to create one.

Comments

-2

You can create an empty data frame by using following syntax in pyspark:

df = spark.createDataFrame([], ["col1", "col2", ...])

where [] represents the empty value for col1 and col2. Then you can register as temp view for your sql queries:

**df2.createOrReplaceTempView("artist")**

1 Comment

It says "Cannot infer schema from empty dataframe"

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.