Manipulating sample json lists using Python/ Spark on Databricks

Question

Attempting to logically parse through the following sample json list:

FruitJson = [
 ('{"num":100, "fruit" : ["apple", "peach", "grape", "melon"]}',), 
 ('{"num":101, "fruit" : ["melon", "apple", "mango", "banana"]}',),  
]

Ideal Output:

fruit	count
apple	2
melon	2
peach	1
grape	1
mangno	1
banana	1

I managed to get the first row of the list into a dataframe, but unable to progress further from here:

dbutils.fs.put("/temp/test.json",'{"num":100, "fruit" : ["apple", "peach", "grape", "melon"]}'\
'{"num":101, "fruit" : ["melon", "apple", "mango", "banana"]}',True)
df = spark.read.option("multiline","true").json('/temp/test.json')
display(df)

You advice is much appreciated.

Updated thread with what i've tried. Basically I managed to upload only the first row into a json.file and then used spark.read.option("multiline","true").json('/temp/test.json') to store data into a dataframe. Been stuck here for a while. Any help is appreciated. — Ibra22
– Ibra22, Commented Aug 3, 2021 at 1:55

pltc · Accepted Answer · 2021-08-03 04:10:56Z

1

First, your multiline option should be False, not True. multiline=False means your JSON has multiple lines, one row per line. Docs

Second, what you're trying to achieve is a simple aggregation, but you will need to explode the list to multiple rows first.

from pyspark.sql import functions as F

(df
    .withColumn('fruit', F.explode('fruit'))
    .groupBy('fruit')
    .agg(
        F.count('*').alias('cnt')
    )
    .show()
)

# +------+---+
# | fruit|cnt|
# +------+---+
# | grape|  1|
# | apple|  2|
# | mango|  1|
# |banana|  1|
# | melon|  2|
# | peach|  1|
# +------+---+

answered Aug 3, 2021 at 4:10

pltc

6,0371 gold badge16 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ibra22 Over a year ago

Thank you! Really appreciate all your help here. Explode was the missing link on my end.

Collectives™ on Stack Overflow

Manipulating sample json lists using Python/ Spark on Databricks

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related