How to create a sample single-column Spark DataFrame in Python?

Question

I want to create a sample single-column DataFrame, but the following code is not working:

df = spark.createDataFrame(["10","11","13"], ("age"))

## ValueError
## ...
## ValueError: Could not parse datatype: age

The expected result:

age
10
11
13

Alper t. Turker · Accepted Answer · 2017-12-06 12:57:40Z

63

the following code is not working

With single element you need a schema as type

spark.createDataFrame(["10","11","13"], "string").toDF("age")

or DataType:

from pyspark.sql.types import StringType

spark.createDataFrame(["10","11","13"], StringType()).toDF("age")

With name elements should be tuples and schema as sequence:

spark.createDataFrame([("10", ), ("11", ), ("13",  )], ["age"])

answered Dec 6, 2017 at 12:57

Alper t. Turker

35.3k9 gold badges89 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Dave Voyles Over a year ago

"With single element you need a schema as type" This is exactly what I was missing, thank you

NickyPatel Over a year ago

This helped me. It was not working because of this unusual comma was absent.

Sarath Chandra Vema · Accepted Answer · 2019-10-18 06:47:58Z

15

Well .. There is some pretty easy method for creating sample dataframe in PySpark

>>> df = sc.parallelize([[1,2,3], [2,3,4]]).toDF()
>>> df.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
|  1|  2|  3|
|  2|  3|  4|
+---+---+---+

to create with some column names

>>> df1 = sc.parallelize([[1,2,3], [2,3,4]]).toDF(("a", "b", "c"))
>>> df1.show()
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
|  2|  3|  4|
+---+---+---+

In this way, no need to define schema too.Hope this is the simplest way

answered Oct 18, 2019 at 6:47

Sarath Chandra Vema

8121 gold badge8 silver badges13 bronze badges

Comments

LN_P · Accepted Answer · 2019-11-11 15:31:40Z

8

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([{"a": "x", "b": "y", "c": "3"}])

Output: (no need to define schema)

+---+---+---+
| a | b | c |
+---+---+---+
|  x|  y|  3|
+---+---+---+

answered Nov 11, 2019 at 15:31

LN_P

1,4884 gold badges23 silver badges37 bronze badges

1 Comment

Daniel Over a year ago

This is deprecated in newer Spark versions. Rather use df = spark.createDataFrame([Row(a="x", b="y", c="3")])

YOLO · Accepted Answer · 2020-01-29 18:22:31Z

5

For pandas + pyspark users, if you've already installed pandas in the cluster, you can do this simply:

# create pandas dataframe
df = pd.DataFrame({'col1':[1,2,3], 'col2':['a','b','c']})

# convert to spark dataframe
df = spark.createDataFrame(df)

Local Spark Setup

import findspark
findspark.init()
import pyspark

spark = (pyspark
         .sql
         .SparkSession
         .builder
         .master("local")
         .getOrCreate())

answered Jan 29, 2020 at 18:22

YOLO

22k5 gold badges25 silver badges42 bronze badges

Comments

Machavity · Accepted Answer · 2021-03-04 18:07:41Z

1

See my farsante lib for creating a DataFrame with fake data:

import farsante

df = farsante.quick_pyspark_df(['first_name', 'last_name'], 7)
df.show()

+----------+---------+
|first_name|last_name|
+----------+---------+
|     Tommy|     Hess|
|    Arthur| Melendez|
|  Clemente|    Blair|
|    Wesley|   Conrad|
|    Willis|   Dunlap|
|     Bruna|  Sellers|
|     Tonda| Schwartz|
+----------+---------+

Here's how to explicitly specify the schema when creating the PySpark DataFrame:

df = spark.createDataFrame(
  [(10,), (11,), (13,)],
  StructType([StructField("some_int", IntegerType(), True)]))

df.show()

+--------+
|some_int|
+--------+
|      10|
|      11|
|      13|
+--------+

edited Mar 4, 2021 at 18:07

Machavity♦

31.8k27 gold badges97 silver badges108 bronze badges

answered Feb 24, 2021 at 3:56

Powers

19.5k12 gold badges113 silver badges115 bronze badges

Comments

Nidhi · Accepted Answer · 2020-09-02 11:04:14Z

0

You can also try something like this -

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc) # sc is the spark context
sample = sqlContext.createDataFrame(
    [
        ('qwe', 23), # enter your data here
        ('rty',34),
        ('yui',56),
        ],
    ['abc', 'def'] # the row header/column labels should be entered here

answered Sep 2, 2020 at 11:04

Nidhi

6315 silver badges7 bronze badges

Comments

Naveen Nelamali · Accepted Answer · 2020-10-28 05:27:52Z

0

There are several ways to create a DataFrame, PySpark Create DataFrame is one of the first steps you learn while working on PySpark

I assume you already have data, columns, and an RDD.

1) df = rdd.toDF()
2) df = rdd.toDF(columns) //Assigns column names
3) df = spark.createDataFrame(rdd).toDF(*columns)
4) df = spark.createDataFrame(data).toDF(*columns)
5) df = spark.createDataFrame(rowData,columns)

Besides these, you can find several examples on pyspark create dataframe

edited Oct 28, 2020 at 5:27

answered Oct 28, 2020 at 5:18

Naveen Nelamali

1,18412 silver badges25 bronze badges

Collectives™ on Stack Overflow

How to create a sample single-column Spark DataFrame in Python?

7 Answers 7

2 Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

2 Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related