PySpark - how to replace null array in JSON file

Question

I have JSON files with null arrays (array (nullable = true) and element: (containsNull = true)) that I'd like to convert to parquet files. These null fields are dropped automatically, while all other columns are converted as expected. Is there a way to replace the null arrays with something else (eg ["-"])? I'm running my code in AWS Glue, but the replacement would take place using pure PySpark and dataframes.

The data type must still be an array in the parquet file (just not with null values).

json:

{
 "a":[],
 "b":[1,2],
 "c": "a string"
}

Expected output:

   a  |   b   |     c
["-"] | [1,2] | "a string"

Current script:

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
spark.conf.set("spark.sql.jsonGenerator.ignoreNullFields", "false")

job = Job(glueContext)
job.init(args['JOB_NAME'], args)

inputDF = glueContext.create_dynamic_frame_from_options(connection_type="s3",
                                                        connection_options:{
                                                             "s3://my-bucket", 
                                                             "recurse":True
                                                           }, 
                                                        format="json")
df = inputDF.toDF()
# replace empty arrays here
dynamic = DynamicFrame.fromDF(df, glueContext, "dynamic")

output = glueContext.write_dynamic_frame.from_options(frame=dynamic,
                                                      connection_type="s3",
                                                      connection_options={
                                                           "path":"s3://my-bucket"
                                                           },
                                                      format="parquet"
                                                      )

Mohana B C · Accepted Answer · 2022-01-20 06:19:34Z

1

Not sure which spark version you are using, I have checked with spark 3.1.2 and 2.4.5 where empty array fields are not getting ignored.

You can use below line to get desired result,

df.withColumn('a', when(size('a')== 0, array(lit('-'))).otherwise(col('a'))).show()

+---+------+--------+
|  a|     b|       c|
+---+------+--------+
|[-]|[1, 2]|a string|
+---+------+--------+

answered Jan 20, 2022 at 6:19

Mohana B C

5,4721 gold badge13 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Omega Over a year ago

That works perfectly, thanks! Is there a way of identifying all columns (struct, array or string) which are NullFields?

Mohana B C Over a year ago

Do you need to get the columns which has all nulls and one or more?

Omega Over a year ago

Only columns which are all null.

Mohana B C Over a year ago

Try this: stackoverflow.com/questions/57044484/…

Collectives™ on Stack Overflow

PySpark - how to replace null array in JSON file

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related