Flatten JSON in Pandas DataFrame with Inconsistent Format

Question

I'm working with a column in a pandas dataframe that I would like to flatten into columns. The issue is that some columns are single dictionaries and others are lists of dictionaries.

So for instance the array in a cell in the column may look like this:

{'a': 'x', 'b': 'y', 'c': 'z'}

or like:

[{'a': 'x', 'b': 'y', 'c': 'z'}, {'a': 'd', 'b': 'e', 'c': 'f'}]

I've been playing around with the standard json_normalize function but the output can't seem to handle there being two different types of array:

json_flattened = pd.io.json.json_normalize(df.json_cell.apply(json.loads))

This gives me AttributeError: 'list' object has no attribute 'items'

What I'm hoping to get out the other end is a dataframe that will end up with as many columns as there are repeated elements in the array.

.-------.---.---.---.-----.-----.------.
| index | a | b | c | a_1 | b_1 | c_1  |
:-------+---+---+---+-----+-----+------:
|     0 | x | y | z | NaN | NaN | NaN  |
:-------+---+---+---+-----+-----+------:
|     1 | x | y | z | d   | e   | f    |
'-------'---'---'---'-----'-----'------'

Ben Pap · Accepted Answer · 2019-06-15 00:44:10Z

1

Because of the inconsistent format in the column you have no choice but to loop through the column and inspect each cell.

import pandas as pd
import numpy as np
from pandas.io.json import json_normalize


data = pd.DataFrame({'json':[{'a': 'x', 'b': 'y', 'c': 'z'},[{'a': 'x', 'b': 'y', 'c': 'z'}, {'a': 'd', 'b': 'e', 'c': 'f'}]]})

dflist = []
for index, row in data.iterrows():
    norm = json_normalize(data['json'][index])
    dflist.append(norm)

grouped = pd.concat(dflist, keys = np.arange(len(dflist)))

df2 = grouped.reset_index()

df2 = df2.pivot('level_0','level_1')

df2.index.name = None
df2.columns.names = [None, None]
df2.columns = df2.columns.map('{0[0]}_{0[1]}'.format) 

#output

    a_0  a_1    b_0  b_1    c_0  c_1
0   x    NaN    y    NaN    z    NaN
1   x    d      y    e      z    f

We loop through the column, reading each into row into its own df. Next we concat them together but assign each keys to keep them grouped. Next we pivot on the keys and the index. Finally we manipulate the multilevel columns generated from the pivot to get what we need.

edited Jun 15, 2019 at 0:44

answered Jun 15, 2019 at 0:30

Ben Pap

2,5791 gold badge10 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

ndevito1 Over a year ago

Hey thanks a lot. This worked with one small change. The array is a string so had to add json.loads in. So I did j = json.loads(data['json'][index]) at the start of the for loop then loaded j into json_normalize

Collectives™ on Stack Overflow

Flatten JSON in Pandas DataFrame with Inconsistent Format

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related