I'm looking for a generic solution to extract all the json fields as columns from a JSON string column.
df = spark.read.load(path)
df.show()
File format of the files in 'path' is parquet
Sample data
|id | json_data
| 1 | {"name":"abc", "depts":["dep01", "dep02"]}
| 2 | {"name":"xyz", "depts":["dep03"],"sal":100}
| 3 | {"name":"pqr", "depts":["dep02"], "address":{"city":"SF","state":"CA"}}
Expected output
|id | name | depts | sal | address_city | address_state
| 1 | "abc" | ["dep01", "dep02"] | null| null | null
| 2 | "xyz" | ["dep03"] | 100 | null | null
| 3 | "pqr" | ["dep02"] | null| "SF" | "CA"
I know that I can extract the columns by creating a StructType with the schema defined and using 'from_json' method.
But this approach requires manual schema definition.
val myStruct = StructType(
Seq(
StructField("name", StringType),
StructField("depts", ArrayType(StringType)),
StructField("sal", IntegerType)
))
var newDf = df.withColumn("depts", from_json(col("depts"), myStruct))
Is there a better way to flatten the JSON column without manually defining the schema? In the example provided, I can see the JSON fields available. But in reality, I can't traverse all the rows to find all the fields.
So I'm looking for a solution to split all the fields to columns without specifying the names or types of the columns.