18

To pass schema to a json file we do this:

from pyspark.sql.types import (StructField, StringType, StructType, IntegerType)
data_schema = [StructField('age', IntegerType(), True), StructField('name', StringType(), True)]
final_struc = StructType(fields = data_schema)
df =spark.read.json('people.json', schema=final_struc)

The above code works as expected. However now, I have data in table which I display by:

df = sqlContext.sql("SELECT * FROM people_json")               

But if I try to pass a new schema to it by using following command it does not work.

df2 = spark.sql("SELECT * FROM people_json", schema=final_struc)

It gives the following error:

sql() got an unexpected keyword argument 'schema'

NOTE: I am using Databrics Community Edition

  • What am I missing?
  • How do I pass the new schema if I have data in the table instead of some JSON file?
2
  • doesn't sql() takes only one parameter as the string? Commented Feb 12, 2018 at 5:28
  • @ShankarKoirala Yes. That is the issue I'm trying to figure a way out of. My question is how do I pass the new schema if I have data in the table instead of some json file? Commented Feb 12, 2018 at 5:31

2 Answers 2

25

You cannot apply a new schema to already created dataframe. However, you can change the schema of each column by casting to another datatype as below.

df.withColumn("column_name", $"column_name".cast("new_datatype"))

If you need to apply a new schema, you need to convert to RDD and create a new dataframe again as below

df = sqlContext.sql("SELECT * FROM people_json")
val newDF = spark.createDataFrame(df.rdd, schema=schema)

Hope this helps!

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for your answer. But, doing newDF = spark.createDataFrame(df.rdd, data_schema) gives AttributeError: 'StructField' object has no attribute 'encode' Error
I passed final_struc instead of data_schema in spark.createDataFrame(df.rdd, schema=final_struc)
Thank you! I was using createDataFrame(df,collect(), schema=schema) and it was very slow and memory ineficient.
3

There is already one answer available but still I want to add something.

  1. Create DF from RDD
  • using toDF

    newDf = rdd.toDF(schema, column_name_list)

  • using createDataFrame

    newDF = spark.createDataFrame(rdd ,schema, [list_of_column_name])

  1. Create DF from other DF

suppose I have DataFrame with columns|data type - name|string, marks|string, gender|string.

if I want to get only marks as integer.

newDF = oldDF.select("marks")
newDF_with_int = newDF.withColumn("marks", df['marks'].cast('Integer'))

This will convert marks to integer.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.