How can explode a nested json structure in Pandas?

Question

I'm working with a file containing one json block per line. Each line looks something like this:

{"a":3,"b":10,"unnecessaryList":[{"value":12,"colName":"c"},{"value":792,"colName":"d"},{"value":645,"colName":"e"}],"index":"-1417561653"}

The producer of the json chose an unnecessary nested structure whereas a flat structure would have been perfectly sufficient. That said, I'd like to read the data into a Pandas DataFrame in a more obvious flattened structure that would have columns "a", "b", "c", "d", "e", "index". The best that I've come up with so far is to process the file twice in different ways:

import pandas as pd
from pandas.io.json import json_normalize, loads

raw_json = pd.read_json('sample.json', lines=True)
raw_json.set_index('index', inplace=True)

with open('sample.json') as f:
    lines = f.readlines()
    exploded_columns = pd.concat([json_normalize(loads(l), 'unnecessaryList', 'index').pivot(index='index', columns='colName', values='value') for l in lines])

data = pd.merge(raw_json[['a', 'b']], exploded_columns, left_index=True, right_index=True)

Is there a way to avoid reading the data twice like this? Does Pandas offer some functionality that could avoid the concat/normalize/pivot/merge junk I came up with?

jezrael · Accepted Answer · 2017-12-23 10:17:17Z

You can use:

df = pd.read_json('sample.json', lines=True)
#create Multiindex from 3 columns and select unnecessaryList for Series
s = df.set_index(['a','b','index'])['unnecessaryList']
print (s)
a  b   index      
3  10  -1417561653    [{'value': 12, 'colName': 'c'}, {'value': 792,...
       -1417561655    [{'value': 13, 'colName': 'c'}, {'value': 794,...
       -1417561658    [{'value': 14, 'colName': 'c'}, {'value': 795,...
Name: unnecessaryList, dtype: object

#create DataFrame for each dict and concat, transpose
L = [pd.DataFrame(x).set_index('colName')['value'] for x in s]
df = (pd.concat(L, axis=1, keys=s.index)
       .T
       .reset_index(level=[0,1])
       .rename(columns={'level_0':'a','level_1':'b'})
       .rename_axis(None, 1))
print (df)
             a   b   c    d    e
-1417561653  3  10  12  792  645
-1417561655  3  10  13  794  645
-1417561658  3  10  14  795  645

Input data in json:

{"a":3,"b":10,"unnecessaryList":[{"value":12,"colName":"c"},{"value":792,"colName":"d"},{"value":645,"colName":"e"}],"index":"-1417561653"}
{"a":3,"b":10,"unnecessaryList":[{"value":13,"colName":"c"},{"value":794,"colName":"d"},{"value":645,"colName":"e"}],"index":"-1417561655"}
{"a":3,"b":10,"unnecessaryList":[{"value":14,"colName":"c"},{"value":795,"colName":"d"},{"value":645,"colName":"e"}],"index":"-1417561658"}

Collectives™ on Stack Overflow

How can explode a nested json structure in Pandas?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related