0

Is there any method where i can create a json from a spark dataframe by not using those fields which are null:

Lets suppose i have a data frame:

+-------+----------------+

|   name|       hit_songs|

+-------+----------------+

|beatles|[help, hey jude]|

|  romeo|      [eres mia]|

| juliet|      null      |

+-------+----------------+

i want to convert it into a json like:

[{
name: "beatles",
hit_songs: [help, hey jude]
},
{
name: "romeo",
hit_songs: [eres mia]
},
{
name: "juliet"
}
]

i dont want the field hit_songs in the json_object if its value is null

1

1 Answer 1

0

Use to_json function for this case.


df=spark.createDataFrame([("beatles",["help","hey juude"]),("romeo",["eres mia"]),("juliet",None)],["name","hit_songs"])

from pyspark.sql.functions import *

df.groupBy(lit(1)).\
agg(collect_list(to_json(struct('name','hit_songs'))).alias("json")).\
drop("1").\
show(10,False)
#+-------------------------------------------------------------------------------------------------------------------+
#|json                                                                                                               |
#+-------------------------------------------------------------------------------------------------------------------+
#|[{"name":"beatles","hit_songs":["help","hey juude"]}, {"name":"romeo","hit_songs":["eres mia"]}, {"name":"juliet"}]|
#+-------------------------------------------------------------------------------------------------------------------+

#using toJSON function.
df.groupBy(lit(1)).\
agg(collect_list(struct('name','hit_songs')).alias("json")).\
drop("1").\
toJSON().\
collect()
#[u'{"json":[{"name":"beatles","hit_songs":["help","hey juude"]},{"name":"romeo","hit_songs":["eres mia"]},{"name":"juliet"}]}']
Sign up to request clarification or add additional context in comments.

4 Comments

I am sorry, this did not work, as i am further doing the repartition. Me along with my team have moved to a different approach. Can you help me on the other solution. Your previous solution worked so i can rely on you for this too: stackoverflow.com/questions/61838445/…
@ShrutiGusain, repartition doesn't cause any issues.. and I'm not sure what error you are getting using to_json function.
when i am converting my dataframe using tojson it shows the value as { name: "juliet", hit_songs: null } which is causing null exception in api of our application
to_json function doesn't preserve null's, if you have null string in your dataframe convert to null then convert back to json object. as in my example i have hit_songs as null and hit_songs is not part of juliet json object.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.