6

I have defined my schema for the df in a json file as follows:

{
    "table1":{
        "fields":[
            {"metadata":{}, "name":"first_name", "type":"string", "nullable":false},
            {"metadata":{}, "name":"last_name", "type":"string", "nullable":false},
            {"metadata":{}, "name":"subjects", "type":"array","items":{"type":["string", "string"]}, "nullable":false},
            {"metadata":{}, "name":"marks", "type":"array","items":{"type":["integer", "integer"]}, "nullable":false},
            {"metadata":{}, "name":"dept", "type":"string", "nullable":false}       
        ]
    }

}

EG JSON DATA:

{
    "table1": [
        {
            "first_name":"john",
            "last_name":"doe",
            "subjects":["maths","science"],
            "marks":[90,67],
            "dept":"abc"        
        },
        {
            "first_name":"dan",
            "last_name":"steyn",
            "subjects":["maths","science"],
            "marks":[90,67],
            "dept":"abc"        
        },
        {
            "first_name":"rose",
            "last_name":"wayne",
            "subjects":["maths","science"],
            "marks":[90,67],
            "dept":"abc"            
        },
        {
            "first_name":"nat",
            "last_name":"lee",
            "subjects":["maths","science"],
            "marks":[90,67],
            "dept":"abc"        
        },
        {
            "first_name":"jim",
            "last_name":"lim",
            "subjects":["maths","science"],
            "marks":[90,67],
            "dept":"abc"        
        }       
    ]
}

I want to create the equivalent spark schema from this json file. Below is my code: (reference: Create spark dataframe schema from json schema representation)

with open(schemaFile) as s:
 schema = json.load(s)["table1"]
 source_schema = StructType.fromJson(schema)

The above code works fine if i dont have any array columns. But throws the below error if i have array columns in my schema.

"Could not parse datatype: array" ("Could not parse datatype: %s" json_value)

5
  • 1
    Have you tried do it backward ? You create a schema as a Python object, including arrays, then convert it to json and see what are the differences. Commented May 28, 2019 at 10:19
  • The provided schema is not valid, there is a comma missing after "items":{"type":["string", "string"]}. I think is better to post your actual data or just try to load the json in Spark and then export that schema that was created by Spark Commented May 29, 2019 at 18:08
  • @AlexandrosBiratsis: Schema updated. My actual data is a csv file. I am trying to include this schema in a json file which is having multiple schemas, and while reading the csv file in spark, i will refer to this json file to get the correct schema to provide the correct column headers and datatype. Commented May 30, 2019 at 2:42
  • Yes I see @blackfury although your schema is again invalid! "items":{"type":["string", "string"]} is not a valid definition, what exactly are you trying to say here? Can you post some actual json data? Commented May 30, 2019 at 8:56
  • @AlexandrosBiratsis: Added a sample json data Commented May 30, 2019 at 9:06

1 Answer 1

11

In your case there was an issue with the representation of the arrays. The correct syntax is:

{ "metadata": {}, "name": "marks", "nullable": true, "type": {"containsNull": true, "elementType": "long", "type": "array" } }.

In order to retrieve the schema from json you can write the next pyspark snippet:

jsonData = """{
    "table1": [{
            "first_name": "john",
            "last_name": "doe",
            "subjects": ["maths", "science"],
            "marks": [90, 67],
            "dept": "abc"
        },
        {
            "first_name": "dan",
            "last_name": "steyn",
            "subjects": ["maths", "science"],
            "marks": [90, 67],
            "dept": "abc"
        },
        {
            "first_name": "rose",
            "last_name": "wayne",
            "subjects": ["maths", "science"],
            "marks": [90, 67],
            "dept": "abc"
        },
        {
            "first_name": "nat",
            "last_name": "lee",
            "subjects": ["maths", "science"],
            "marks": [90, 67],
            "dept": "abc"
        },
        {
            "first_name": "jim",
            "last_name": "lim",
            "subjects": ["maths", "science"],
            "marks": [90, 67],
            "dept": "abc"
        }
    ]
}"""

df = spark.read.json(sc.parallelize([jsonData]))

df.schema.json()

This should output:

{
    "fields": [{
        "metadata": {},
        "name": "table1",
        "nullable": true,
        "type": {
            "containsNull": true,
            "elementType": {
                "fields": [{
                    "metadata": {},
                    "name": "dept",
                    "nullable": true,
                    "type": "string"
                }, {
                    "metadata": {},
                    "name": "first_name",
                    "nullable": true,
                    "type": "string"
                }, {
                    "metadata": {},
                    "name": "last_name",
                    "nullable": true,
                    "type": "string"
                }, {
                    "metadata": {},
                    "name": "marks",
                    "nullable": true,
                    "type": {
                        "containsNull": true,
                        "elementType": "long",
                        "type": "array"
                    }
                }, {
                    "metadata": {},
                    "name": "subjects",
                    "nullable": true,
                    "type": {
                        "containsNull": true,
                        "elementType": "string",
                        "type": "array"
                    }
                }],
                "type": "struct"
            },
            "type": "array"
        }
    }],
    "type": "struct"
}

Alternatively, you could use df.schema.simpleString() this will return an relatively simpler schema format:

struct<table1:array<struct<dept:string,first_name:string,last_name:string,marks:array<bigint>,subjects:array<string>>>>

Finally you can store the schema above into a file and load it later on with:

import json
new_schema = StructType.fromJson(json.loads(schema_json))

As you did already. Remember that you could achieve the described process dynamically as well for any json data.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.