Convert json dictionary to dataframe in Python

Question

My API gives me a json file as output with the following structure:

{

"results": [

    {

        "statement_id": 0,

        "series": [

            {

                "name": "PCJeremy",

                "tags": {

                    "host": "001"

                },

                "columns": [

                    "time",

                    "memory"

                ],

                "values": [

                    [

                        "2021-03-20T23:00:00Z",

                        1049911288

                    ],

                    [

                        "2021-03-21T00:00:00Z",

                        1057692712

                    ],
    ]

            },

            {

                "name": "PCJohnny",

                "tags": {

                    "host": "002"

                },

                "columns": [

                    "time",

                    "memory"

                ],

                "values": [

                    [

                        "2021-03-20T23:00:00Z",

                        407896064

                    ],

                    [

                        "2021-03-21T00:00:00Z",

                        406847488

                    ]


                ]

            }

        ]

    }

  ]
}

I want to transform this output to a pandas dataframe so I can create some reports from it. I tried using the pdDataFrame.from_dict method:

with open(fn) as f:
   data = json.load(f)
print(pd.DataFrame.from_dict(data))

But as a resulting set, I just get one column and one row with all the data back:

results 0 {'statement_id': 0, 'series': [{'name': 'Jerem...

The structure is just quite hard to understand for me as I am no professional. I would like to get a dataframe with 4 columns: name, host, time and memory with a row of data for every combination of values in the json file. Example:

name     host        time                memory
JeremyPC  001  "2021-03-20T23:00:00Z"  1049911288
JeremyPC  001  "2021-03-21T00:00:00Z"  1049911288

Is this in any way possible? Thanks a lot in advance!

Ynjxsjmh · Accepted Answer · 2021-04-12 09:31:36Z

First extract the data from json you are interested in

extracted_data = []

for series in data['results'][0]['series']:
    d = {}
    d['name'] = series['name']
    d['host'] = series['tags']['host']
    d['time'] = [value[0] for value in series['values']]
    d['memory'] = [value[1] for value in series['values']]

    extracted_data.append(d)

df = pd.DataFrame(extracted_data)

# print(df)

       name host                                          time                    memory
0  PCJeremy  001  [2021-03-20T23:00:00Z, 2021-03-21T00:00:00Z]  [1049911288, 1057692712]
1  PCJohnny  002  [2021-03-20T23:00:00Z, 2021-03-21T00:00:00Z]    [407896064, 406847488]

Second, explode multiple columns into rows

df1 = pd.concat([df.explode('time')['time'], df.explode('memory')['memory']], axis=1)

df_ = df.drop(['time','memory'], axis=1).join(df1).reset_index(drop=True)

# print(df_)

       name host                  time      memory
0  PCJeremy  001  2021-03-20T23:00:00Z  1049911288
1  PCJeremy  001  2021-03-21T00:00:00Z  1057692712
2  PCJohnny  002  2021-03-20T23:00:00Z   407896064
3  PCJohnny  002  2021-03-21T00:00:00Z   406847488

With carefully constructing the dict, it could be done without exploding.

extracted_data = []

for series in data['results'][0]['series']:
    d = {}
    d['name'] = series['name']
    d['host'] = series['tags']['host']

    for values in series['values']:
        d_ = d.copy()
        for column, value in zip(series['columns'], values):
            d_[column] = value

        extracted_data.append(d_)

df = pd.DataFrame(extracted_data)

sammywemmy · Accepted Answer · 2021-04-12 08:44:54Z

You could jmespath to extract the data; it is quite a handy tool for such nested json data. You can read the docs for more details; I will summarize the basics: If you want to access a key, use a dot, if you want to access values in a list, use []. Combination of these two will help in traversing the json paths. There are more tools; these basics should get you started.

Your json is wrapped in a data variable:

data
 
{'results': [{'statement_id': 0,
   'series': [{'name': 'PCJeremy',
     'tags': {'host': '001'},
     'columns': ['time', 'memory'],
     'values': [['2021-03-20T23:00:00Z', 1049911288],
      ['2021-03-21T00:00:00Z', 1057692712]]},
    {'name': 'PCJohnny',
     'tags': {'host': '002'},
     'columns': ['time', 'memory'],
     'values': [['2021-03-20T23:00:00Z', 407896064],
      ['2021-03-21T00:00:00Z', 406847488]]}]}]}

Let's create an expression to parse the json, and get the specific values:

expression = """{name: results[].series[].name, 
                 host: results[].series[].tags.host, 
                 time: results[].series[].values[*][0], 
                 memory: results[].series[].values[*][-1]}
             """

Parse the expression to the json data:

expression = jmespath.compile(expression).search(data)

expression
{'name': ['PCJeremy', 'PCJohnny'],
 'host': ['001', '002'],
 'time': [['2021-03-20T23:00:00Z', '2021-03-21T00:00:00Z'],
  ['2021-03-20T23:00:00Z', '2021-03-21T00:00:00Z']],
 'memory': [[1049911288, 1057692712], [407896064, 406847488]]}

Note the time and memory are nested lists, and match the values in data:

Create dataframe and explode relevant columns:

pd.DataFrame(expression).apply(pd.Series.explode)

       name host                  time      memory
0  PCJeremy  001  2021-03-20T23:00:00Z  1049911288
0  PCJeremy  001  2021-03-21T00:00:00Z  1057692712
1  PCJohnny  002  2021-03-20T23:00:00Z   407896064
1  PCJohnny  002  2021-03-21T00:00:00Z   406847488

Collectives™ on Stack Overflow

Convert json dictionary to dataframe in Python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related