1

This is a follow up question from post. @abiratis thanks for your answer, we are trying implement the same in our glue jobs, the only change is that we don't have a static schema defined, so we have created a new column colSchema to hold the schema of each entry of some-array attribute. Which looks like this:

+------------------------+-----------------------------------------------------------------------------------------------------------------------+
|some-array              |colSchema                                                                                                              |
+------------------------+-----------------------------------------------------------------------------------------------------------------------+
|[{f1a, f2a}, {f1b, f2b}]|ArrayType(StructType(List(StructField(array-field-1,StringType,true),StructField(array-field-2,StringType,true))),true)|
+------------------------+-----------------------------------------------------------------------------------------------------------------------+

But while converting it to json format using from_json i'm getting this error:

conversion is done this:

final_df.select(from_json(col('some-array'), 'ArrayType(StructType(List(StructField(array-field-1,StringType,true),
StructField(array-field-2,StringType,true))),true)' {'allowUnquotedFieldNames':True}).alias('json1')).show(3, False)

Error is :

AnalysisException: Cannot parse the schema in JSON format: Unrecognized token 'ArrayType': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
 at [Source: (String)"ArrayType(StructType(List(StructField(array-field-1,StringType,true),StructField(array-field-2,StringType,true))),true)"; line: 1, column: 10]
Failed fallback parsing: Cannot parse the data type: 
mismatched input 'StructType' expecting INTEGER_VALUE(line 1, pos 10)

== SQL ==
ArrayType(StructType(List(StructField(array-field-1,StringType,true),StructField(array-field-2,StringType,true))),true)
----------^^^

Any help would be highly appreciated.

1 Answer 1

1

from the from_json's documentation:

schema: DataType or str a StructType or ArrayType of StructType to use when parsing the json column.

Changed in version 2.3: the DDL-formatted string is also supported for schema.

The first parameter should be a json like column, which you have correct. The second parameter is either a DataType or a str formatted as a DDL string. This you got wrogn, since you are passing a DataType as a simple string. That isn't valid.

For your example, I think the correct definition would be simething like the following:

'ARRAY<STRUCT<`array-field-1`: STRING, `array-field-2`: STRING>>'

Therefore,

final_df.select(from_json(col('some-array'), 'ARRAY<STRUCT<`array-field-1`: STRING, `array-field-2`: STRING>>')

should work

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks @Ismor, i have tested this works but have a new subsequent issue on this 1. how do we get the DDL for a single column in PySpark OR how can we leverage the schema obtained from df.schema['some-array'].dataType to be used? I even tried getting the schema from schema_of_json() API but even thats not working. Please let me know if you come across any such solution. The issue is getting DDL for a single column is not working. I may have to do some hacks to get just the DDL for some-array
@user1119283 um.... I think you can't do it. Spark needs to know the schema before execution. There are some workarounds like inferSchema but those work only on reading files. A reason for this to not work is that if two rows have different schemas, the resulting dataframe would be inconsistent. Maybe using the RDD api reading everything as a bytestring could help. The workflow would be read as bytestring rdd >> separate in multiple rdd each with the same schema >> convert each RDD in a different dataframe. Any case, I think you can't avoid some non-trivial data manipulations.
Thanks @ismore for this flow will try out this.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.