Pyspark split array of JSON objects column to multiple columns

Question

I have a pyspark dataframe in which one of the column is in below format:

[{key1: value1},{key2:value2}, {key3:value3}, {key4:value4}]

Lets say it as ColumnY as below:

ColumnY
[{key1: value1},{key2:value2}, {key3:value3}, {key4:value4}]

I would like to convert it into columns of the dataframe where column name is keyX and its contents are valueX where X=[1,4] as below:

key 1	key 2	key 3	key 4
value1	value2	value3	value4

I have tried some solutions but they didn't work. Request you to share any ideas or solutions if you have. Thank you in advance.

mck · Accepted Answer · 2021-04-04 07:15:14Z

1

That is a very badly formatted JSON without any quotes, but you can still parse it by brute force:

import pyspark.sql.functions as F

df2 = df.selectExpr("""
    explode(
        transform(
            split(ColumnY, ','), 
            x -> str_to_map(regexp_replace(x, '[\\\\[\\\\{ \\\\]\\\\}]', ''), ' ', ':')
        )
    ) as col
""").select(F.explode('col')).groupBy().pivot('key').agg(F.first('value'))

df2.show()
+------+------+------+------+
|  key1|  key2|  key3|  key4|
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+

answered Apr 4, 2021 at 7:15

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pyspark split array of JSON objects column to multiple columns

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related