2

I'm working with a column in a pandas dataframe that I would like to flatten into columns. The issue is that some columns are single dictionaries and others are lists of dictionaries.

So for instance the array in a cell in the column may look like this:

{'a': 'x', 'b': 'y', 'c': 'z'}

or like:

[{'a': 'x', 'b': 'y', 'c': 'z'}, {'a': 'd', 'b': 'e', 'c': 'f'}]

I've been playing around with the standard json_normalize function but the output can't seem to handle there being two different types of array:

json_flattened = pd.io.json.json_normalize(df.json_cell.apply(json.loads))

This gives me AttributeError: 'list' object has no attribute 'items'

What I'm hoping to get out the other end is a dataframe that will end up with as many columns as there are repeated elements in the array.

.-------.---.---.---.-----.-----.------.
| index | a | b | c | a_1 | b_1 | c_1  |
:-------+---+---+---+-----+-----+------:
|     0 | x | y | z | NaN | NaN | NaN  |
:-------+---+---+---+-----+-----+------:
|     1 | x | y | z | d   | e   | f    |
'-------'---'---'---'-----'-----'------'

1 Answer 1

1

Because of the inconsistent format in the column you have no choice but to loop through the column and inspect each cell.

import pandas as pd
import numpy as np
from pandas.io.json import json_normalize


data = pd.DataFrame({'json':[{'a': 'x', 'b': 'y', 'c': 'z'},[{'a': 'x', 'b': 'y', 'c': 'z'}, {'a': 'd', 'b': 'e', 'c': 'f'}]]})

dflist = []
for index, row in data.iterrows():
    norm = json_normalize(data['json'][index])
    dflist.append(norm)

grouped = pd.concat(dflist, keys = np.arange(len(dflist)))

df2 = grouped.reset_index()

df2 = df2.pivot('level_0','level_1')

df2.index.name = None
df2.columns.names = [None, None]
df2.columns = df2.columns.map('{0[0]}_{0[1]}'.format) 

#output

    a_0  a_1    b_0  b_1    c_0  c_1
0   x    NaN    y    NaN    z    NaN
1   x    d      y    e      z    f

We loop through the column, reading each into row into its own df. Next we concat them together but assign each keys to keep them grouped. Next we pivot on the keys and the index. Finally we manipulate the multilevel columns generated from the pivot to get what we need.

Sign up to request clarification or add additional context in comments.

1 Comment

Hey thanks a lot. This worked with one small change. The array is a string so had to add json.loads in. So I did j = json.loads(data['json'][index]) at the start of the for loop then loaded j into json_normalize

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.