3

I am trying to create a nested json from my spark dataframe which has data in following structure. The below code is creating a simple json with key and value. Could you please help

df.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True)

Update1: As per @MaxU answer,I converted the spark data frame to pandas and used group by. It is putting the last two fields in a nested array. How could i first put the category and count in nested array and then inside that array i want to put subcategory and count.

Sample text data:

Vendor_Name,count,Categories,Category_Count,Subcategory,Subcategory_Count
Vendor1,10,Category 1,4,Sub Category 1,1
Vendor1,10,Category 1,4,Sub Category 2,2
Vendor1,10,Category 1,4,Sub Category 3,3
Vendor1,10,Category 1,4,Sub Category 4,4

j = (data_pd.groupby(['vendor_name','vendor_Cnt','Category','Category_cnt'], as_index=False)
             .apply(lambda x: x[['Subcategory','subcategory_cnt']].to_dict('r'))
             .reset_index()
             .rename(columns={0:'subcategories'})
             .to_json(orient='records'))

enter image description here

[{
        "vendor_name": "Vendor 1",
        "count": 10,
        "categories": [{
            "name": "Category 1",
            "count": 4,
            "subCategories": [{
                    "name": "Sub Category 1",
                    "count": 1
                },
                {
                    "name": "Sub Category 2",
                    "count": 1
                },
                {
                    "name": "Sub Category 3",
                    "count": 1
                },
                {
                    "name": "Sub Category 4",
                    "count": 1
                }
            ]
        }]
1
  • @MaxU I have updated it Commented Nov 27, 2018 at 10:50

2 Answers 2

8

You need to re-structure the whole dataframe for that.

"subCategories" is a struct stype.

from pyspark.sql import functions as F
df.withColumn(
  "subCategories",
  F.struct(
    F.col("subCategories").alias("name"),
    F.col("subcategory_count").alias("count")
  )
)

and then, groupBy and use F.collect_list to create the array.

At the end, you need to have only 1 record in your dataframe to get the result you expect.

Sign up to request clarification or add additional context in comments.

1 Comment

Hi @Steven , how can I extend this answer to a more nested json?
3

The easiest way to do this in python/pandas would be to use a series of nested generators using groupby I think:

def split_df(df):
    for (vendor, count), df_vendor in df.groupby(["Vendor_Name", "count"]):
        yield {
            "vendor_name": vendor,
            "count": count,
            "categories": list(split_category(df_vendor))
        }

def split_category(df_vendor):
    for (category, count), df_category in df_vendor.groupby(
        ["Categories", "Category_Count"]
    ):
        yield {
            "name": category,
            "count": count,
            "subCategories": list(split_subcategory(df_category)),
        }

def split_subcategory(df_category):
    for row in df.itertuples():
        yield {"name": row.Subcategory, "count": row.Subcategory_Count}

list(split_df(df))
[
    {
        "vendor_name": "Vendor1",
        "count": 10,
        "categories": [
            {
                "name": "Category 1",
                "count": 4,
                "subCategories": [
                    {"name": "Sub Category 1", "count": 1},
                    {"name": "Sub Category 2", "count": 2},
                    {"name": "Sub Category 3", "count": 3},
                    {"name": "Sub Category 4", "count": 4},
                ],
            }
        ],
    }
]

To export this to json, you'll need a way to export the np.int64

2 Comments

Wowwwwww. Thanks a lot
@ShankarPanda how can you accept this answer, it is not even in spark ...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.