PySpark - Json explode nested with Struct and array of struct

Question

I am trying to parse nested json with some sample json. Below is the print schema

 |-- batters: struct (nullable = true)
 |    |-- batter: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- ppu: double (nullable = true)
 |-- topping: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |-- type: string (nullable = true)

Trying to explode batters,topping separately and combine them.

df_batter = df_json.select("batters.*")
df_explode1= df_batter.withColumn("batter", explode("batter")).select("batter.*")

df_explode2= df_json.withColumn("topping", explode("topping")).select("id", 
"type","name","ppu","topping.*")

Unable to combine the two data frame.

Tried using single query

exploded1 = df_json.withColumn("batter", df_batter.withColumn("batter", 
explode("batter"))).withColumn("topping", explode("topping")).select("id", 
"type","name","ppu","topping.*","batter.*")

But getting error.Kindly help me to solve it. Thanks

u can't explode two arrays like that. you need to zip them using arrays_zip and then explode them together — murtihash
– murtihash, Commented Apr 1, 2020 at 15:25

murtihash · Accepted Answer · 2020-04-01 16:20:10Z

1

You basically have to explode the arrays together using arrays_zip which returns a merged array of structs. Try this. I haven't tested but it should work.

from pyspark.sql import functions as F    
df_json.select("id","type","name","ppu","topping","batters.*")\
       .withColumn("zipped", F.explode(F.arrays_zip("batter","topping")))\
       .select("id","type","name","ppu","zipped.*").show()

You could also do it one by one:

from pyspark.sql import functions as F    
    df1=df_json.select("id","type","name","ppu","topping","batters.*")\
           .withColumn("batter", F.explode("batter"))\
           .select("id","type","name","ppu","topping","batter")
    df1.withColumn("topping", F.explode("topping")).select("id","type","name","ppu","topping.*","batter.*")

edited Apr 1, 2020 at 16:20

answered Apr 1, 2020 at 15:35

murtihash

8,4401 gold badge16 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

PySpark - Json explode nested with Struct and array of struct

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related