Pyspark JSON array of objects into columns

Question

Im ingesting JSON files into spark and i have come across an object as below in the nested JSON from the file

"data": {
  "key1" :"v1" 
  "key2" : [
     {"nk1" :"nv1"}, 
     {"nk2" :"nv2" }, 
     {"nk3" :"nv3" } 
  ] 
}

After reading it in spark, it is changing into below format:

"data": {
  "key1" :"v1" 
  "key2" : [
     {"nk1" :"nv1", "nk2" :null, "nk3" :null}, 
     {"nk1" :null, "nk2" :"nv2", "nk3" :null}, 
     {"nk1" :null, "nk2" :null, "nk3" :"nv3"} 
  ] 
}

I need them as columns in the spark dataframe

"key1"	"nk1"	"nk2"	"nk3"
"v1"	"kv1"	"kv2"	"kv3"

Please help me with any solution for this. I'm thinking to convert this to string and use regex. Is there any better solution?

Where did "key2" go in your dataframe? Have you tried using an explode() function on the array? — OneCricketeer
– OneCricketeer, Commented Apr 5, 2021 at 14:53

mck · Accepted Answer · 2021-04-05 14:55:15Z

1

You can explode the array and pivot key2:

import pyspark.sql.functions as F

df2 = df.select(
    F.col('data.key1').alias('key1'), 
    F.explode('data.key2').alias('key2')
).select(
    'key1', 
    F.map_keys('key2')[0].alias('key'), 
    F.map_values('key2')[0].alias('val')
).groupBy('key1').pivot('key').agg(F.first('val'))

df2.show()
+----+---+---+---+
|key1|nk1|nk2|nk3|
+----+---+---+---+
|  v1|nv1|nv2|nv3|
+----+---+---+---+

answered Apr 5, 2021 at 14:55

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pyspark JSON array of objects into columns

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related