I'm trying to create a manual schema for a dataframe. The data I am passing in is an RDD created from json. Here is my initial data:
json2 = sc.parallelize(['{"name": "mission", "pandas": {"attributes": "[0.4, 0.5]", "pt": "giant", "id": "1", "zip": "94110", "happy": "True"}}'])
Then here is how schema is specified:
schema = StructType(fields=[
StructField(
name='name',
dataType=StringType(),
nullable=True
),
StructField(
name='pandas',
dataType=ArrayType(
StructType(
fields=[
StructField(
name='id',
dataType=StringType(),
nullable=False
),
StructField(
name='zip',
dataType=StringType(),
nullable=True
),
StructField(
name='pt',
dataType=StringType(),
nullable=True
),
StructField(
name='happy',
dataType=BooleanType(),
nullable=False
),
StructField(
name='attributes',
dataType=ArrayType(
elementType=DoubleType(),
containsNull=False
),
nullable=True
)
]
),
containsNull=True
),
nullable=True
)
])
When I use sqlContext.createDataFrame(json2, schema) and then try to do a show() on the resulting dataframe I receive the following error:
ValueError: Unexpected tuple '{"name": "mission", "pandas": {"attributes": "[0.4, 0.5]", "pt": "giant", "id": "1", "zip": "94110", "happy": "True"}}' with StructType