Spark: Generating JSON schema for a JSON string

Question

Im using Spark 2.4.3 and Scala 2.11

Below is my current JSON string in a DataFrame column. Im trying to store the schema of this JSON string in another column using schema_of_json function. But its throwing below the error. How could I resolve this?

{
  "company": {
    "companyId": "123",
    "companyName": "ABC"
  },
  "customer": {
    "customerDetails": {
      "customerId": "CUST-100",
      "customerName": "CUST-AAA",
      "status": "ACTIVE",
      "phone": {
        "phoneDetails": {
          "home": {
            "phoneno": "666-777-9999"
          },
          "mobile": {
            "phoneno": "333-444-5555"
          }
        }
      }
    },
    "address": {
      "loc": "NORTH",
      "adressDetails": [
        {
          "street": "BBB",
          "city": "YYYYY",
          "province": "AB",
          "country": "US"
        },
        {
          "street": "UUU",
          "city": "GGGGG",
          "province": "NB",
          "country": "US"
        }
      ]
    }
  }
}

Code:

val df = spark.read.textFile("./src/main/resources/json/company.txt")
df.printSchema()
df.show()

root
 |-- value: string (nullable = true)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                                                                                                                                                                                                                                                              |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"company":{"companyId":"123","companyName":"ABC"},"customer":{"customerDetails":{"customerId":"CUST-100","customerName":"CUST-AAA","status":"ACTIVE","phone":{"phoneDetails":{"home":{"phoneno":"666-777-9999"},"mobile":{"phoneno":"333-444-5555"}}}},"address":{"loc":"NORTH","adressDetails":[{"street":"BBB","city":"YYYYY","province":"AB","country":"US"},{"street":"UUU","city":"GGGGG","province":"NB","country":"US"}]}}}|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+


df.withColumn("jsonSchema",schema_of_json(col("value")))

Error:

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'schemaofjson(`value`)' due to data type mismatch: The input json should be a string literal and not null; however, got `value`.;;
'Project [value#0, schemaofjson(value#0) AS jsonSchema#10]
+- Project [value#0]
   +- Relation[value#0] text

Leibnitz · Accepted Answer · 2019-12-02 16:37:28Z

5

The workaround solution that I found was to pass the column value as below to the schema_of_json function.

df.withColumn("jsonSchema",schema_of_json(df.select(col("value")).first.getString(0)))

Courtesy:

Implicit schema discovery on a JSON-formatted Spark DataFrame column

answered Dec 2, 2019 at 16:37

Leibnitz

3556 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

iDataEngX Over a year ago

For pyspark it would be df.withColumn("jsonSchema",schema_of_json(df.select(col("value")).first()[0]))

bottaio · Accepted Answer · 2019-12-01 10:00:28Z

1

Since SPARK-24709 was introduced schema_of_json accepts just literal strings. You can extract schema of String in DDL format by calling

spark.read
  .json(df.select("value").as[String])
  .schema
  .toDDL

answered Dec 1, 2019 at 10:00

bottaio

5,1233 gold badges21 silver badges49 bronze badges

1 Comment

Leibnitz Over a year ago

Thanks, in my case, NOT every row has the same json string with same schema. How do I handle that with this?

ZettaP · Accepted Answer · 2021-07-05 08:31:31Z

0

If one is looking for a pyspark answer :

import pyspark.sql.functions as F
import pyspark.sql.types as T
import json
    
  def process(json_content):
      if json_content is None : 
        return []
      try:
        # Parse the content of the json, extract the keys only
        keys = json.loads(json_content).keys()
        return list(keys)
      except Exception as e:
        return [e]
    
    udf_function = F.udf(process_file, T.ArrayType(T.StringType()))
    my_df = my_df.withColumn("schema", udf_function(F.col("json_raw"))

answered Jul 5, 2021 at 8:31

ZettaP

1,49412 silver badges15 bronze badges

Collectives™ on Stack Overflow

Spark: Generating JSON schema for a JSON string

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related