pyspark from_json is failing with error: Cannot parse the schema in JSON format: Unrecognized token 'array': was expecting (JSON String, Number, Array

Question

parquet_path = /tmp/test-parquet

content of t2.json is:

{
   "id": "OK_good2", 
   "some-array": [
      {"array-field-1":"f1a","array-field-2":"f2a"},
      {"array-field-1":"f1b","array-field-2":"f2b"}
   ]
}

creating dataframe from t2.json

df = spark.read.json('t2.json')
df = df.withColumn('some-array', col('some-array').cast('string'))
df.write.mode("overwrite").parquet(parquet_path)

formed schema from:

schema = dict(df.dtypes)['some-array'] # o/p array<struct<array-field-1:string,array-field-2:string>>

Reading from parquet_path:

final_df = spark.read.parquet(parquet_path)
final_df.select('some-array').show(3, False)

+------------------------+
|some-array              |
+------------------------+
|[{f1a, f2a}, {f1b, f2b}]|
+------------------------+

While trying to get the JSON schema of the same using from_json its failing. Im not able to figure out why. Please provide some help.

final_df.select(from_json(col('some-array'), 'array<struct<array-field-1:string,array-field-2:string>>', {'allowUnquotedFieldNames':True}).
                alias('json1')).show(2, False)

throws error:

AnalysisException: Cannot parse the schema in JSON format: Unrecognized token 'array': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
 at [Source: (String)"array<struct<array-field-1:string,array-field-2:string>>"; line: 1, column: 6]
Failed fallback parsing: Cannot parse the data type:

In case anyone interested i'm trying to follow this post

Emma · Accepted Answer · 2022-06-08 14:25:31Z

2

A few issues in your code.

You are losing key data from the JSON in this line.

df = df.withColumn('some-array', col('some-array').cast('string'))

So, when you are reading the parquet file that is saved, you only see this.

+------------------------+
|some-array              |
+------------------------+
|[{f1a, f2a}, {f1b, f2b}]|
+------------------------+

and NOT. (This is what you expect to have.)

+-------------------------------------------------------+
|some-array                                             |
+-------------------------------------------------------+
|[{"array_field_1": "f1a", "array_field_2": "f2a"}, ...]|
+-------------------------------------------------------+

The proper way to cast the array structure to JSON string is to use to_json.

df = df.withColumn('some-array', to_json('some-array'))

Please check df.show() or df.take(1) to see the difference.

Your DDL string needs wrapping with (`) for column name. (ref: https://vincent.doba.fr/posts/20211004_spark_data_description_language_for_defining_spark_schema/)

'array<struct<`array-field-1`:string,`array-field-2`:string>>'

answered Jun 8, 2022 at 14:25

Emma

9,5781 gold badge22 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user1119283 Over a year ago

thanks @Emma for pointing out the details. This works for me.

Usergtwuser Over a year ago

@Emma It works thanks for giving all the details. :-)

Usergtwuser Over a year ago

@Emma any pointers as to why we don't get json keys even without casting the some-array column to string on df.show() ` df = spark.read.json('t2.json') df.show(5, False) ` O/P ` +--------+------------------------+ |id |some-array | +--------+------------------------+ |OK_good2|[{f1a, f2a}, {f1b, f2b}]| +--------+------------------------+ `

Emma Over a year ago

Try df.printSchema(). If the some-array is array of struct, that how it is displayed on most of tools (without key value). printSchema() should show the key name in this case.

Collectives™ on Stack Overflow

pyspark from_json is failing with error: Cannot parse the schema in JSON format: Unrecognized token 'array': was expecting (JSON String, Number, Array

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related