1

I have a pandas column with nested json data string. I'd like to flatten the data into multiple pandas columns. I have data like this:

{
 'A': '123',
 'B': '2019-08-26', 
 'C': [
       {
        'a': 'stop', 
        'b': 'A+'
       },
       {
        'a': 'go', 
        'b': 'C+'
       }
      ], 
'D': [],
'E': [
      {
       'a': 'Don', 
      'b': 1
      },
      {
       'b': 12
      }
     ], 
}

For each cell in pandas column, I'd like parse this string and create multiple columns. Expected output looks something like this:

| A | B | C.a | C.b | D.a | D.b | E.a | E.b |
|---- |------|-----|-----|-----|-----|-----|-----|
| 123  | 2019-08-26  | stop | A+ | Nan | Nan | Don | 1 |
| 123  | 2019-08-26  | go | C+ | Nan | Nan | Don | 1 |
| 123  | 2019-08-26  | stop | A+ | Nan | Nan | NaN | 12 |
| 123  | 2019-08-26  | go | C+ | Nan | Nan | Nan | 12 |

I tried using json_normalize, but it return error.... Please help me :(

2 Answers 2

2

Use pd.json_normalize with df.explode and pd.concat:

In [308]: x = pd.json_normalize(j).explode('C').explode('E')  
In [310]: r = pd.concat([x.drop(['C', 'E'], 1).reset_index(drop=True), pd.json_normalize(x.C), pd.json_normalize(x.E)], 1)

In [316]: C_cols = [f'C.{i}' for i in pd.json_normalize(x.C).columns]    
In [317]: E_cols = [f'E.{i}' for i in pd.json_normalize(x.E).columns]

In [323]: r.columns = [*x.drop(['C', 'E'], 1).columns , *C_cols, *E_cols]

In [324]: r
Out[324]: 
     A           B   D   C.a C.b  E.a  E.b
0  123  2019-08-26  []  stop  A+  Don    1
1  123  2019-08-26  []  stop  A+  NaN   12
2  123  2019-08-26  []    go  C+  Don    1
3  123  2019-08-26  []    go  C+  NaN   12
Sign up to request clarification or add additional context in comments.

Comments

1

Similar to @Mayank Porwal's answer, first use pd.json_normalize + df.explode. Then use str.get method to collect the values from dictionaries in columns ['C','D','E']:

df = pd.json_normalize(json_data).explode('C').explode('E')
for col in ['C','D','E']:
    for i in ['a','b']:
        df[col+'.'+i] = df[col].str.get(i)
df['E.a'].replace({None:np.nan}, inplace=True)
df = df.drop(['C','E','D'], axis=1).sort_values(by='E.b')

Output:

     A           B   C.a C.b  D.a  D.b  E.a  E.b
0  123  2019-08-26  stop  A+  NaN  NaN  Don    1
0  123  2019-08-26    go  C+  NaN  NaN  Don    1
0  123  2019-08-26  stop  A+  NaN  NaN  NaN   12
0  123  2019-08-26    go  C+  NaN  NaN  NaN   12

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.