0

How to transform nested json with dynamic keys and arays to pandas dataframe?

  • Static keys: data, label, units, date, val, num (can be hardcoded)
  • Dynamic keys/arrays: data_1_a, data_1000_xyz , name_1a , name_1b, name_10000_xyz, A, B (cannot be hardcoded as they are up to 10000 names / data sub categories)

For solutions I tried please see below useful links.

input json:

{
    "id": 1,
    "data": {
        "data_1_a": {
            "name_1a": {
                "label": "label_1",
                "units": {
                       "A": [{"date": 2020, "val": 1}]}}
        },
        "data_1000_xyz": {
            "name_1b": {
                "label": "null",
                "units": {
                        "B": [{"date": 2019, "val": 2},
                          {"date": 2020, "val": 3}]},
            },
            "name_10000_xyz": {
                "label": "null",
                "units": {
                       "A": [
                        {"date": 2018, "val": 4, "num": "str"},
                        {"date": 2019, "val": 5},
                        {"date": 2020, "val": 6, "num": "str"},
                    ]
                },
            },
        },
    },
}

required output df:

+---+--------------+----------------+---------+-------+------+-----+------+
|id |level_1       |level_2         |level_3  |level_4| date | val | num  | 
+---+--------------+----------------+---------+-------+------+-----+------+
|1  |data_1_a      | name_1a        | unit    | A     | 2020 |  1  | null |
|1  |data_1000_xyz | name_1b        | unit    | B     | 2019 |  2  | null |
|1  |data_1000_xyz | name_1b        | unit    | B     | 2020 |  3  | null |
|1  |data_1000_xyz | name_10000_xyz | unit    | A     | 2018 |  4  | str  |
|1  |data_1000_xyz | name_10000_xyz | unit    | A     | 2019 |  5  | null |
|1  |data_1000_xyz | name_10000_xyz | unit    | A     | 2020 |  6  | str  |
+-------------------------------------------------------------------------+

Usefull links:

0

1 Answer 1

1

Python Pandas solution:

    import pandas as pd
    
    # 1) flatten json
    df = pd.json_normalize(json_1)
    df_dic = df.to_dict('records')
    
    # 2) split to levels
    data = []
    for row in df_dic:
        k={}
        for item in row.items():
            if item[0] == 'id':
                id = item[1]
            else:  
                keys  = item[0].split('.')
                k = {i:s for i,s in enumerate(keys)}
                k.update({'value':item[1]})
                k.update({'id':id})
                data.append(k)
    
    df = (pd.DataFrame(data)[['id',1,2,3,4,'value']]
    .rename(columns={1:'level_1',2:'level_2',3:'level_3',4:'level_4' }))
    df = df.loc[~df['level_4'].isnull()]

    # 3) explode
    dfe = df.explode('value', ignore_index=True)
    
    # 4) pop the value column and create a new dataframe from it then join the new frame with the exploded frame.
    output_df = dfe.join(pd.DataFrame([*dfe.pop('value')], index=dfe.index))
    
    
        id  level_1         level_2         level_3 level_4 date    val num
    0   1   data_1_a        name_1a         units   A      2020     1   NaN
    1   1   data_1000_xyz   name_1b         units   B      2019     2   NaN
    2   1   data_1000_xyz   name_1b         units   B      2020     3   NaN
    3   1   data_1000_xyz   name_10000_xyz  units   A      2018     4   str
    4   1   data_1000_xyz   name_10000_xyz  units   A      2019     5   NaN
    5   1   data_1000_xyz   name_10000_xyz  units   A      2020     6   str
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.