How to pass schema to create a new Dataframe from existing Dataframe?

Question

To pass schema to a json file we do this:

from pyspark.sql.types import (StructField, StringType, StructType, IntegerType)
data_schema = [StructField('age', IntegerType(), True), StructField('name', StringType(), True)]
final_struc = StructType(fields = data_schema)
df =spark.read.json('people.json', schema=final_struc)

The above code works as expected. However now, I have data in table which I display by:

df = sqlContext.sql("SELECT * FROM people_json")

But if I try to pass a new schema to it by using following command it does not work.

df2 = spark.sql("SELECT * FROM people_json", schema=final_struc)

It gives the following error:

sql() got an unexpected keyword argument 'schema'

NOTE: I am using Databrics Community Edition

What am I missing?
How do I pass the new schema if I have data in the table instead of some JSON file?

@ShankarKoirala Yes. That is the issue I'm trying to figure a way out of. My question is how do I pass the new schema if I have data in the table instead of some json file? — Niladri Basu
– Niladri Basu, Commented Feb 12, 2018 at 5:31

koiralo · Accepted Answer · 2018-02-12 06:29:27Z

25

You cannot apply a new schema to already created dataframe. However, you can change the schema of each column by casting to another datatype as below.

df.withColumn("column_name", $"column_name".cast("new_datatype"))

If you need to apply a new schema, you need to convert to RDD and create a new dataframe again as below

df = sqlContext.sql("SELECT * FROM people_json")
val newDF = spark.createDataFrame(df.rdd, schema=schema)

Hope this helps!

edited Feb 12, 2018 at 6:29

answered Feb 12, 2018 at 5:36

koiralo

23.2k6 gold badges57 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Niladri Basu Over a year ago

Thank you for your answer. But, doing newDF = spark.createDataFrame(df.rdd, data_schema) gives AttributeError: 'StructField' object has no attribute 'encode' Error

Niladri Basu Over a year ago

I passed final_struc instead of data_schema in spark.createDataFrame(df.rdd, schema=final_struc)

imatiasmb Over a year ago

Thank you! I was using createDataFrame(df,collect(), schema=schema) and it was very slow and memory ineficient.

bhargav3vedi · Accepted Answer · 2023-09-08 13:39:12Z

3

There is already one answer available but still I want to add something.

Create DF from RDD

using toDF

newDf = rdd.toDF(schema, column_name_list)
using createDataFrame

newDF = spark.createDataFrame(rdd ,schema, [list_of_column_name])

Create DF from other DF

suppose I have DataFrame with columns|data type - name|string, marks|string, gender|string.

if I want to get only marks as integer.

newDF = oldDF.select("marks")
newDF_with_int = newDF.withColumn("marks", df['marks'].cast('Integer'))

This will convert marks to integer.

edited Sep 8, 2023 at 13:39

answered Feb 11, 2022 at 12:01

bhargav3vedi

6391 gold badge8 silver badges16 bronze badges

Collectives™ on Stack Overflow

How to pass schema to create a new Dataframe from existing Dataframe?

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related