Pyspark: Create Schema from Json Schema involving Array columns

Question

I have defined my schema for the df in a json file as follows:

{
    "table1":{
        "fields":[
            {"metadata":{}, "name":"first_name", "type":"string", "nullable":false},
            {"metadata":{}, "name":"last_name", "type":"string", "nullable":false},
            {"metadata":{}, "name":"subjects", "type":"array","items":{"type":["string", "string"]}, "nullable":false},
            {"metadata":{}, "name":"marks", "type":"array","items":{"type":["integer", "integer"]}, "nullable":false},
            {"metadata":{}, "name":"dept", "type":"string", "nullable":false}       
        ]
    }

}

EG JSON DATA:

{
    "table1": [
        {
            "first_name":"john",
            "last_name":"doe",
            "subjects":["maths","science"],
            "marks":[90,67],
            "dept":"abc"        
        },
        {
            "first_name":"dan",
            "last_name":"steyn",
            "subjects":["maths","science"],
            "marks":[90,67],
            "dept":"abc"        
        },
        {
            "first_name":"rose",
            "last_name":"wayne",
            "subjects":["maths","science"],
            "marks":[90,67],
            "dept":"abc"            
        },
        {
            "first_name":"nat",
            "last_name":"lee",
            "subjects":["maths","science"],
            "marks":[90,67],
            "dept":"abc"        
        },
        {
            "first_name":"jim",
            "last_name":"lim",
            "subjects":["maths","science"],
            "marks":[90,67],
            "dept":"abc"        
        }       
    ]
}

I want to create the equivalent spark schema from this json file. Below is my code: (reference: Create spark dataframe schema from json schema representation)

with open(schemaFile) as s:
 schema = json.load(s)["table1"]
 source_schema = StructType.fromJson(schema)

The above code works fine if i dont have any array columns. But throws the below error if i have array columns in my schema.

"Could not parse datatype: array" ("Could not parse datatype: %s" json_value)

Have you tried do it backward ? You create a schema as a Python object, including arrays, then convert it to json and see what are the differences. — Steven
– Steven, Commented May 28, 2019 at 10:19
The provided schema is not valid, there is a comma missing after "items":{"type":["string", "string"]}. I think is better to post your actual data or just try to load the json in Spark and then export that schema that was created by Spark — abiratsis
– abiratsis, Commented May 29, 2019 at 18:08
@AlexandrosBiratsis: Schema updated. My actual data is a csv file. I am trying to include this schema in a json file which is having multiple schemas, and while reading the csv file in spark, i will refer to this json file to get the correct schema to provide the correct column headers and datatype. — blackfury
– blackfury, Commented May 30, 2019 at 2:42
Yes I see @blackfury although your schema is again invalid! "items":{"type":["string", "string"]} is not a valid definition, what exactly are you trying to say here? Can you post some actual json data? — abiratsis
– abiratsis, Commented May 30, 2019 at 8:56

abiratsis · Accepted Answer · 2019-05-30 10:51:03Z

In your case there was an issue with the representation of the arrays. The correct syntax is:

{ "metadata": {}, "name": "marks", "nullable": true, "type": {"containsNull": true, "elementType": "long", "type": "array" } }.

In order to retrieve the schema from json you can write the next pyspark snippet:

jsonData = """{
    "table1": [{
            "first_name": "john",
            "last_name": "doe",
            "subjects": ["maths", "science"],
            "marks": [90, 67],
            "dept": "abc"
        },
        {
            "first_name": "dan",
            "last_name": "steyn",
            "subjects": ["maths", "science"],
            "marks": [90, 67],
            "dept": "abc"
        },
        {
            "first_name": "rose",
            "last_name": "wayne",
            "subjects": ["maths", "science"],
            "marks": [90, 67],
            "dept": "abc"
        },
        {
            "first_name": "nat",
            "last_name": "lee",
            "subjects": ["maths", "science"],
            "marks": [90, 67],
            "dept": "abc"
        },
        {
            "first_name": "jim",
            "last_name": "lim",
            "subjects": ["maths", "science"],
            "marks": [90, 67],
            "dept": "abc"
        }
    ]
}"""

df = spark.read.json(sc.parallelize([jsonData]))

df.schema.json()

This should output:

{
    "fields": [{
        "metadata": {},
        "name": "table1",
        "nullable": true,
        "type": {
            "containsNull": true,
            "elementType": {
                "fields": [{
                    "metadata": {},
                    "name": "dept",
                    "nullable": true,
                    "type": "string"
                }, {
                    "metadata": {},
                    "name": "first_name",
                    "nullable": true,
                    "type": "string"
                }, {
                    "metadata": {},
                    "name": "last_name",
                    "nullable": true,
                    "type": "string"
                }, {
                    "metadata": {},
                    "name": "marks",
                    "nullable": true,
                    "type": {
                        "containsNull": true,
                        "elementType": "long",
                        "type": "array"
                    }
                }, {
                    "metadata": {},
                    "name": "subjects",
                    "nullable": true,
                    "type": {
                        "containsNull": true,
                        "elementType": "string",
                        "type": "array"
                    }
                }],
                "type": "struct"
            },
            "type": "array"
        }
    }],
    "type": "struct"
}

Alternatively, you could use df.schema.simpleString() this will return an relatively simpler schema format:

struct<table1:array<struct<dept:string,first_name:string,last_name:string,marks:array<bigint>,subjects:array<string>>>>

Finally you can store the schema above into a file and load it later on with:

import json
new_schema = StructType.fromJson(json.loads(schema_json))

As you did already. Remember that you could achieve the described process dynamically as well for any json data.

Collectives™ on Stack Overflow

Pyspark: Create Schema from Json Schema involving Array columns

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related