2

I'm trying to find an easy way to flatten a nested JSON present in a dataframe column. The dataframe column looks as follows:

stock   Name            Annual
x       Tesla           {"0": {"date": "2020","dateFormatted": "2020-12-31","sharesMln": "3856.2405","shares": 3856240500},"1": {"date": "2019","dateFormatted": "2019-12-31","sharesMln": "3856.2405","shares": 3856240500}}
y       Google          {"0": {"date": "2020","dateFormatted": "2020-12-31","sharesMln": "2526.4506","shares": 2526450600},"1": {"date": "2019","dateFormatted": "2019-12-31","sharesMln": "2526.4506","shares": 2526450600},"2": {"date": "2018","dateFormatted": "2018-12-31","sharesMln": "2578.0992","shares": 2578099200}}
z       Big Apple       {}

How do I convert the above dataframe to:

Stock   Name        date    dateFormatted   sharesMln   shares
x       Tesla       2020    2020-12-31      3856.2405   3856240500
x       Tesla       2019    2019-12-31      3856.2405   3856240500
y       Google      2020    2020-12-31      2526.4506   2526450600
y       Google      2019    2019-12-31      2526.4506   2526450600
y       Google      2018    2018-12-31      2578.0992   2578099200
z       Big Apple   None    None            None        None

I've tried using pd.json_normalize(dataframe['Annual'],max_level=1) but struggling to get the desired result as mentioned above.

Any pointers will be appreciated.

2
  • Your Annual column is not json valid (missing { }). "0": {"date": ...} should be {"0": {"date": ...}}. I'm right? Commented Aug 4, 2021 at 8:41
  • Yes you are correct. I've updated the question to make it a valid dict. Thanks for your response though! Commented Aug 4, 2021 at 10:42

1 Answer 1

2

Get values from dicts and transform each element of the list to a row with explode while index is duplicated. Then, expand the nested dict (values of your first dict) to columns. Finally, you have to join your original dataframe with the new dataframe.

>>> df

  stock       Name                                             Annual
0     x      Tesla  {'0': {'date': '2020', 'dateFormatted': '2020-...
1     y     Google  {'0': {'date': '2020', 'dateFormatted': '2020-...
2     z  Big Apple                                                 {}
data = df['Annual'].apply(lambda x: x.values()) \
                   .explode() \
                   .apply(pd.Series)

df = df.join(data).drop(columns='Annual')

Output result:

>>> df

  stock       Name  date dateFormatted  sharesMln        shares
0     x      Tesla  2020    2020-12-31  3856.2405  3.856240e+09
0     x      Tesla  2019    2019-12-31  3856.2405  3.856240e+09
1     y     Google  2020    2020-12-31  2526.4506  2.526451e+09
1     y     Google  2019    2019-12-31  2526.4506  2.526451e+09
1     y     Google  2018    2018-12-31  2578.0992  2.578099e+09
2     z  Big Apple   NaN           NaN        NaN           NaN
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your great response. I had a jupyter notebook issue earlier but after resetting that, it worked like a charm. However, It would be nice if you could elaborate on what explode() does? Also, does dropna() drops all rows and columns with any NaN values and can we restrict it to drop only columns with all Nan values?
@roller. I updated my answer to give some details. I removed dropna because it's useless. I used it before you fix your column.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.