39

I want to create a sample single-column DataFrame, but the following code is not working:

df = spark.createDataFrame(["10","11","13"], ("age"))

## ValueError
## ...
## ValueError: Could not parse datatype: age

The expected result:

age
10
11
13
0

7 Answers 7

63

the following code is not working

With single element you need a schema as type

spark.createDataFrame(["10","11","13"], "string").toDF("age")

or DataType:

from pyspark.sql.types import StringType

spark.createDataFrame(["10","11","13"], StringType()).toDF("age")

With name elements should be tuples and schema as sequence:

spark.createDataFrame([("10", ), ("11", ), ("13",  )], ["age"])
Sign up to request clarification or add additional context in comments.

2 Comments

"With single element you need a schema as type" This is exactly what I was missing, thank you
This helped me. It was not working because of this unusual comma was absent.
15

Well .. There is some pretty easy method for creating sample dataframe in PySpark

>>> df = sc.parallelize([[1,2,3], [2,3,4]]).toDF()
>>> df.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
|  1|  2|  3|
|  2|  3|  4|
+---+---+---+

to create with some column names

>>> df1 = sc.parallelize([[1,2,3], [2,3,4]]).toDF(("a", "b", "c"))
>>> df1.show()
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
|  2|  3|  4|
+---+---+---+

In this way, no need to define schema too.Hope this is the simplest way

Comments

8
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([{"a": "x", "b": "y", "c": "3"}])

Output: (no need to define schema)

+---+---+---+
| a | b | c |
+---+---+---+
|  x|  y|  3|
+---+---+---+

1 Comment

This is deprecated in newer Spark versions. Rather use df = spark.createDataFrame([Row(a="x", b="y", c="3")])
5

For pandas + pyspark users, if you've already installed pandas in the cluster, you can do this simply:

# create pandas dataframe
df = pd.DataFrame({'col1':[1,2,3], 'col2':['a','b','c']})

# convert to spark dataframe
df = spark.createDataFrame(df)

Local Spark Setup

import findspark
findspark.init()
import pyspark

spark = (pyspark
         .sql
         .SparkSession
         .builder
         .master("local")
         .getOrCreate())

Comments

1

See my farsante lib for creating a DataFrame with fake data:

import farsante

df = farsante.quick_pyspark_df(['first_name', 'last_name'], 7)
df.show()
+----------+---------+
|first_name|last_name|
+----------+---------+
|     Tommy|     Hess|
|    Arthur| Melendez|
|  Clemente|    Blair|
|    Wesley|   Conrad|
|    Willis|   Dunlap|
|     Bruna|  Sellers|
|     Tonda| Schwartz|
+----------+---------+

Here's how to explicitly specify the schema when creating the PySpark DataFrame:

df = spark.createDataFrame(
  [(10,), (11,), (13,)],
  StructType([StructField("some_int", IntegerType(), True)]))

df.show()
+--------+
|some_int|
+--------+
|      10|
|      11|
|      13|
+--------+

Comments

0

You can also try something like this -

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc) # sc is the spark context
sample = sqlContext.createDataFrame(
    [
        ('qwe', 23), # enter your data here
        ('rty',34),
        ('yui',56),
        ],
    ['abc', 'def'] # the row header/column labels should be entered here

Comments

0

There are several ways to create a DataFrame, PySpark Create DataFrame is one of the first steps you learn while working on PySpark

I assume you already have data, columns, and an RDD.

1) df = rdd.toDF()
2) df = rdd.toDF(columns) //Assigns column names
3) df = spark.createDataFrame(rdd).toDF(*columns)
4) df = spark.createDataFrame(data).toDF(*columns)
5) df = spark.createDataFrame(rowData,columns)

Besides these, you can find several examples on pyspark create dataframe

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.