4

I'm trying to create a manual schema for a dataframe. The data I am passing in is an RDD created from json. Here is my initial data:

json2 = sc.parallelize(['{"name": "mission", "pandas": {"attributes": "[0.4, 0.5]", "pt": "giant", "id": "1", "zip": "94110", "happy": "True"}}'])

Then here is how schema is specified:

schema = StructType(fields=[
    StructField(
        name='name',
        dataType=StringType(),
        nullable=True
    ),
    StructField(
        name='pandas',
        dataType=ArrayType(
            StructType(
                fields=[
                    StructField(
                        name='id',
                        dataType=StringType(),
                        nullable=False
                    ),
                    StructField(
                        name='zip',
                        dataType=StringType(),
                        nullable=True
                    ),
                    StructField(
                        name='pt',
                        dataType=StringType(),
                        nullable=True
                    ),
                    StructField(
                        name='happy',
                        dataType=BooleanType(),
                        nullable=False
                    ),
                    StructField(
                        name='attributes',
                        dataType=ArrayType(
                            elementType=DoubleType(),
                            containsNull=False
                        ),
                        nullable=True

                    )
                ]
            ),
            containsNull=True
        ),
        nullable=True
    )
])

When I use sqlContext.createDataFrame(json2, schema) and then try to do a show() on the resulting dataframe I receive the following error:

ValueError: Unexpected tuple '{"name": "mission", "pandas": {"attributes": "[0.4, 0.5]", "pt": "giant", "id": "1", "zip": "94110", "happy": "True"}}' with StructType
0

1 Answer 1

5

First of all json2 is just a RDD[String]. Spark has no special knowledge about serialization format used to encode the data. Moreover it expects a RDD or Row or some product and it is clearly not the case.

In Scala you could use

sqlContext.read.schema(schema).json(rdd) 

with RDD[String] but there are two problems:

  • this approach is not directly accessible in PySpark
  • even if it was schema you've created is simply invalid:

    • pandas is a struct not and array
    • pandas.happy is not a string a boolean
    • pandas.attributes is string not array

Schema is used only to avoid type inference a not for type casting or any other transformations. If you want to transform data you'll have to parse it first:

def parse(s: str) -> Row:
    return ...

rdd.map(parse).toDF(schema)

Assuming that the you have JSON like this (fixed types):

{"name": "mission", "pandas": {"attributes": [0.4, 0.5], "pt": "giant", "id": "1", "zip": "94110", "happy": true}} 

correct schema would look as follows

StructType([
    StructField("name", StringType(), True),
    StructField("pandas", StructType([
        StructField("attributes", ArrayType(DoubleType(), True), True),
        StructField("happy", BooleanType(), True),
        StructField("id", StringType(), True),
        StructField("pt", StringType(), True),
        StructField("zip", StringType(), True))],
    True)])
Sign up to request clarification or add additional context in comments.

3 Comments

I made them all strings because of an example on the Databricks website, so I can easily change them back. How do you create an Array Type (nested struct) in Pyspark?
Nested struct is not ArrayType. Nested struct is just struct. And struct is simply product type (tuple / Row). Array is array (list).
Makes sense, I'm trying to convert over from Scala without knowing a whole lot of Scala.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.