1

I have JSON files with null arrays (array (nullable = true) and element: (containsNull = true)) that I'd like to convert to parquet files. These null fields are dropped automatically, while all other columns are converted as expected. Is there a way to replace the null arrays with something else (eg ["-"])? I'm running my code in AWS Glue, but the replacement would take place using pure PySpark and dataframes.

The data type must still be an array in the parquet file (just not with null values).

json:

{
 "a":[],
 "b":[1,2],
 "c": "a string"
}

Expected output:

   a  |   b   |     c
["-"] | [1,2] | "a string"

Current script:

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
spark.conf.set("spark.sql.jsonGenerator.ignoreNullFields", "false")

job = Job(glueContext)
job.init(args['JOB_NAME'], args)

inputDF = glueContext.create_dynamic_frame_from_options(connection_type="s3",
                                                        connection_options:{
                                                             "s3://my-bucket", 
                                                             "recurse":True
                                                           }, 
                                                        format="json")
df = inputDF.toDF()
# replace empty arrays here
dynamic = DynamicFrame.fromDF(df, glueContext, "dynamic")

output = glueContext.write_dynamic_frame.from_options(frame=dynamic,
                                                      connection_type="s3",
                                                      connection_options={
                                                           "path":"s3://my-bucket"
                                                           },
                                                      format="parquet"
                                                      )

1 Answer 1

1

Not sure which spark version you are using, I have checked with spark 3.1.2 and 2.4.5 where empty array fields are not getting ignored.

You can use below line to get desired result,

df.withColumn('a', when(size('a')== 0, array(lit('-'))).otherwise(col('a'))).show()

+---+------+--------+
|  a|     b|       c|
+---+------+--------+
|[-]|[1, 2]|a string|
+---+------+--------+  
Sign up to request clarification or add additional context in comments.

4 Comments

That works perfectly, thanks! Is there a way of identifying all columns (struct, array or string) which are NullFields?
Do you need to get the columns which has all nulls and one or more?
Only columns which are all null.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.