Flatten nested JSON columns in Pandas

Question

I'm trying to find an easy way to flatten a nested JSON present in a dataframe column. The dataframe column looks as follows:

stock   Name            Annual
x       Tesla           {"0": {"date": "2020","dateFormatted": "2020-12-31","sharesMln": "3856.2405","shares": 3856240500},"1": {"date": "2019","dateFormatted": "2019-12-31","sharesMln": "3856.2405","shares": 3856240500}}
y       Google          {"0": {"date": "2020","dateFormatted": "2020-12-31","sharesMln": "2526.4506","shares": 2526450600},"1": {"date": "2019","dateFormatted": "2019-12-31","sharesMln": "2526.4506","shares": 2526450600},"2": {"date": "2018","dateFormatted": "2018-12-31","sharesMln": "2578.0992","shares": 2578099200}}
z       Big Apple       {}

How do I convert the above dataframe to:

Stock   Name        date    dateFormatted   sharesMln   shares
x       Tesla       2020    2020-12-31      3856.2405   3856240500
x       Tesla       2019    2019-12-31      3856.2405   3856240500
y       Google      2020    2020-12-31      2526.4506   2526450600
y       Google      2019    2019-12-31      2526.4506   2526450600
y       Google      2018    2018-12-31      2578.0992   2578099200
z       Big Apple   None    None            None        None

I've tried using pd.json_normalize(dataframe['Annual'],max_level=1) but struggling to get the desired result as mentioned above.

Any pointers will be appreciated.

Your Annual column is not json valid (missing { }). "0": {"date": ...} should be {"0": {"date": ...}}. I'm right? — Corralien
– Corralien, Commented Aug 4, 2021 at 8:41
Yes you are correct. I've updated the question to make it a valid dict. Thanks for your response though! — roller
– roller, Commented Aug 4, 2021 at 10:42

Corralien · Accepted Answer · 2021-08-04 12:29:25Z

2

Get values from dicts and transform each element of the list to a row with explode while index is duplicated. Then, expand the nested dict (values of your first dict) to columns. Finally, you have to join your original dataframe with the new dataframe.

>>> df

  stock       Name                                             Annual
0     x      Tesla  {'0': {'date': '2020', 'dateFormatted': '2020-...
1     y     Google  {'0': {'date': '2020', 'dateFormatted': '2020-...
2     z  Big Apple                                                 {}

data = df['Annual'].apply(lambda x: x.values()) \
                   .explode() \
                   .apply(pd.Series)

df = df.join(data).drop(columns='Annual')

Output result:

>>> df

  stock       Name  date dateFormatted  sharesMln        shares
0     x      Tesla  2020    2020-12-31  3856.2405  3.856240e+09
0     x      Tesla  2019    2019-12-31  3856.2405  3.856240e+09
1     y     Google  2020    2020-12-31  2526.4506  2.526451e+09
1     y     Google  2019    2019-12-31  2526.4506  2.526451e+09
1     y     Google  2018    2018-12-31  2578.0992  2.578099e+09
2     z  Big Apple   NaN           NaN        NaN           NaN

edited Aug 4, 2021 at 12:29

answered Aug 4, 2021 at 8:51

Corralien

121k8 gold badges44 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

roller Over a year ago

Thanks for your great response. I had a jupyter notebook issue earlier but after resetting that, it worked like a charm. However, It would be nice if you could elaborate on what explode() does? Also, does dropna() drops all rows and columns with any NaN values and can we restrict it to drop only columns with all Nan values?

Corralien Over a year ago

@roller. I updated my answer to give some details. I removed dropna because it's useless. I used it before you fix your column.

Collectives™ on Stack Overflow

Flatten nested JSON columns in Pandas

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related