0

My use case is to read an existing json-schema file, parse this json-schema file and build a Spark DataFrame schema out of it. To start off I followed the steps mentioned here.

Steps followed
1.Imported the library from Maven
2.Restarted the cluster
3.Created a sample JSON schema file
4.Used this code to read the sample schema file
val schema = SchemaConverter.convert("/FileStore/tables/schemaFile.json")

When I run above command I get error: not found: value SchemaConverter

To ensure that the library is being called I reattached the notebook to cluster after restarting the cluster.

In addition to trying out the above method, I tried the below as well. I replaced jsonString with the actual JSON schema.

import org.apache.spark.sql.types.{DataType, StructType} val newSchema = DataType.fromJson(jsonString).asInstanceOf[StructType]

the sample Schema I've been playing with has 300+feilds, for simplicity, I used the sample schema from here.

1 Answer 1

1

SchemaConverter works for me. I used spark-shell to test and installed required package as spark-shell --packages "org.zalando:spark-json-schema_2.11:0.6.1".

scala> val schema = SchemaConverter.convertContent("""
 | {
 |   "$schema": "http://json-schema.org/draft-04/schema#",
 |   "title": "Product",
 |   "description": "A product from Acme's catalog",
 |   "type": "object",
 |   "properties": {
 |     "id": {
 |       "description": "The unique identifier for a product",
 |       "type": "integer"
 |     },
 |     "name": {
 |       "description": "Name of the product",
 |       "type": "string"
 |     },
 |     "price": {
 |       "type": "number",
 |       "minimum": 0,
 |       "exclusiveMinimum": true
 |     }
 |   },
 |   "required": [
 |     "id",
 |     "name",
 |     "price"
 |   ]
 | }
 | """)

schema: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false), StructField(name,StringType,false), StructField(price,DoubleType,false))

scala> schema.toString
res1: String = StructType(StructField(id,LongType,false), StructField(name,StringType,false), StructField(price,DoubleType,false))

Do you want to explicitly specify schema while reading json data?, because if you read json data using spark, it automatically infers schema from json data. eg.

val df = spark.read.json("json-file")
df.printSchema() // Gives schema of json data
Sign up to request clarification or add additional context in comments.

2 Comments

The reason for wanting to specifically mention the schema is because I have one field that comes in sometimes as string when only 1 value is present and as an array when more than value is present. As a work around I was hoping that providing the schema may help read the field consistently. I tried this in Databricks, not sure if its Databricks setup is preventing the library from being called. Will try it in spark-shell and see if this works. Cheers.
Can someone point us an similar example in pyspark as well ? Thats is the way to convert such JSON schema to spark schema ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.