0

parquet_path = /tmp/test-parquet

content of t2.json is:

{
   "id": "OK_good2", 
   "some-array": [
      {"array-field-1":"f1a","array-field-2":"f2a"},
      {"array-field-1":"f1b","array-field-2":"f2b"}
   ]
}

creating dataframe from t2.json

df = spark.read.json('t2.json')
df = df.withColumn('some-array', col('some-array').cast('string'))
df.write.mode("overwrite").parquet(parquet_path)

formed schema from:

schema = dict(df.dtypes)['some-array'] # o/p array<struct<array-field-1:string,array-field-2:string>>

Reading from parquet_path:

final_df = spark.read.parquet(parquet_path)
final_df.select('some-array').show(3, False)
+------------------------+
|some-array              |
+------------------------+
|[{f1a, f2a}, {f1b, f2b}]|
+------------------------+

While trying to get the JSON schema of the same using from_json its failing. Im not able to figure out why. Please provide some help.

final_df.select(from_json(col('some-array'), 'array<struct<array-field-1:string,array-field-2:string>>', {'allowUnquotedFieldNames':True}).
                alias('json1')).show(2, False)

throws error:

AnalysisException: Cannot parse the schema in JSON format: Unrecognized token 'array': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
 at [Source: (String)"array<struct<array-field-1:string,array-field-2:string>>"; line: 1, column: 6]
Failed fallback parsing: Cannot parse the data type: 

In case anyone interested i'm trying to follow this post

1 Answer 1

2

A few issues in your code.

  1. You are losing key data from the JSON in this line.
df = df.withColumn('some-array', col('some-array').cast('string'))

So, when you are reading the parquet file that is saved, you only see this.

+------------------------+
|some-array              |
+------------------------+
|[{f1a, f2a}, {f1b, f2b}]|
+------------------------+

and NOT. (This is what you expect to have.)

+-------------------------------------------------------+
|some-array                                             |
+-------------------------------------------------------+
|[{"array_field_1": "f1a", "array_field_2": "f2a"}, ...]|
+-------------------------------------------------------+

The proper way to cast the array structure to JSON string is to use to_json.

df = df.withColumn('some-array', to_json('some-array'))

Please check df.show() or df.take(1) to see the difference.

  1. Your DDL string needs wrapping with (`) for column name. (ref: https://vincent.doba.fr/posts/20211004_spark_data_description_language_for_defining_spark_schema/)
'array<struct<`array-field-1`:string,`array-field-2`:string>>'
Sign up to request clarification or add additional context in comments.

4 Comments

thanks @Emma for pointing out the details. This works for me.
@Emma It works thanks for giving all the details. :-)
@Emma any pointers as to why we don't get json keys even without casting the some-array column to string on df.show() ` df = spark.read.json('t2.json') df.show(5, False) ` O/P ` +--------+------------------------+ |id |some-array | +--------+------------------------+ |OK_good2|[{f1a, f2a}, {f1b, f2b}]| +--------+------------------------+ `
Try df.printSchema(). If the some-array is array of struct, that how it is displayed on most of tools (without key value). printSchema() should show the key name in this case.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.