32

I am new to spark and was playing around with Pyspark.sql. According to the pyspark.sql documentation here, one can go about setting the Spark dataframe and schema like this:

spark= SparkSession.builder.getOrCreate()
from pyspark.sql.types import StringType, IntegerType, 
StructType, StructField

rdd = sc.textFile('./some csv_to_play_around.csv'

schema = StructType([StructField('Name', StringType(), True),
                     StructField('DateTime', TimestampType(), True)
                     StructField('Age', IntegerType(), True)])

# create dataframe
df3 = sqlContext.createDataFrame(rdd, schema)

My question is, what does the True stand for in the schema list above? I can't seem to find it in the documentation. Thanks in advance

2 Answers 2

29

It means if the column allows null values, true for nullable, and false for not nullable

StructField(name, dataType, nullable): Represents a field in a StructType. The name of a field is indicated by name. The data type of a field is indicated by dataType. nullable is used to indicate if values of this fields can have null values.

Refer to Spark SQL and DataFrame Guide for more informations.

Sign up to request clarification or add additional context in comments.

2 Comments

Be aware that this "feature" is known not be reliable and not working. Test it before using and if I were you I would not rely on it
may be this is old post by thank you so much @yhshen for resolution. I literally killed my 5-6 hours finding the issue.
7

You can also use a datatype string:

schema = 'Name STRING, DateTime TIMESTAMP, Age INTEGER'

There's not much documentation on datatype strings, but they mention them in the docs. They're much more compact and readable than StructTypes

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.