16

I have a json data (coming from mongodb) containing thousands of records (so an array/list of json object) with a structure like the below one for each object:

{
   "id":1,
   "first_name":"Mead",
   "last_name":"Lantaph",
   "email":"[email protected]",
   "gender":"Male",
   "ip_address":"231.126.209.31",
   "nested_array_to_expand":[
      {
         "property":"Quaxo",
         "json_obj":{
            "prop1":"Chevrolet",
            "prop2":"Mercy Streets"
         }
      },
      {
         "property":"Blogpad",
         "json_obj":{
            "prop1":"Hyundai",
            "prop2":"Flashback"
         }
      },
      {
         "property":"Yabox",
         "json_obj":{
            "prop1":"Nissan",
            "prop2":"Welcome Mr. Marshall (Bienvenido Mister Marshall)"
         }
      }
   ]
}

When loaded in a dataframe the "nested_array_to_expand" is a string containing the json (I do use "json_normalize" during loading). The expected result is to get a dataframe with 3 row (given the above example) and new columns for the nested objects such as below:

index   email first_name gender  id      ip_address last_name  \
0  [email protected]       Mead   Male   1  231.126.209.31   Lantaph   
1  [email protected]       Mead   Male   1  231.126.209.31   Lantaph   
2  [email protected]       Mead   Male   1  231.126.209.31   Lantaph   

  test.name                                      test.obj.ahah test.obj.buzz  
0     Quaxo                                      Mercy Streets     Chevrolet  
1   Blogpad                                          Flashback       Hyundai  
2     Yabox  Welcome Mr. Marshall (Bienvenido Mister Marshall)        Nissan  

I was able to get that result with the below function but it extremely slow (around 2s for 1k records) so I would like to either improve the existing code or find a completely different approach to get this result.

def expand_field(field, df, parent_id='id'):
    all_sub = pd.DataFrame()
    # we need an id per row to be able to merge back dataframes
    # if no id, then we will create one based on index of rows
    if parent_id not in df:
        df[parent_id] = df.index

    # go through all rows and create a new dataframe with values
    for i, row in df.iterrows():
        try:
            sub = json_normalize(df[field].values[i])
            sub = sub.add_prefix(field + '.')
            sub['parent_id'] = row[parent_id]
            all_sub = all_sub.append(sub)
        except:
            print('crash')
            pass
    df = pd.merge(df, all_sub, left_on=parent_id, right_on='parent_id', how='left')
    #remove old columns
    del df["parent_id"]
    del df[field]
    #return expanded dataframe
    return df

Many thanks for your help.

===== EDIT for answering comment ====

The data loaded from mongodb is an array of object. I load it with the following code:

data = json.loads(my_json_string)
df = json_normalize(data)

The output give me a dataframe with df["nested_array_to_expand"] as dtype object (string)

0    [{'property': 'Quaxo', 'json_obj': {'prop1': '...
Name: nested_array_to_expand, dtype: object
3
  • it is dict or json Commented Dec 12, 2017 at 4:19
  • Thank you for checking it out. I edited the question to answer you with example and code. Basically my json data is an array of object that i load then normalize with json_normalize. The column "nest_array_to_expand" is then a dtype object. Commented Dec 12, 2017 at 4:36
  • pd.json_normalize(data_list, "nested_array_to_expand", ['email', 'first_name', 'gender', 'id', 'ip_address', 'last_name']) where, data_list is a list of dictionary. Commented Dec 8, 2020 at 9:25

3 Answers 3

18

I propose an interesting answer I think using pandas.json_normalize.
I use it to expand the nested json -- maybe there is a better way, but you definitively should consider using this feature. Then you have just to rename the columns as you want.

import io
from pandas import json_normalize

# Loading the json string into a structure
json_dict = json.load(io.StringIO(json_str))

df = pd.concat([pd.DataFrame(json_dict), 
                json_normalize(json_dict['nested_array_to_expand'])], 
                axis=1).drop('nested_array_to_expand', 1)

enter image description here

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you. I will check with my setup :)
Hi Romain, unfortunately that code would not work in my setup as I get a list of dict from my json and not directly a dict.
I figured it out and posted a solution below.
6

The following code is what you want. You can unroll the nested list using python's built in list function and passing that as a new dataframe. pd.DataFrame(list(json_dict['nested_col']))

You might have to do several iterations of this, depending on how nested your data is.

from pandas.io.json import json_normalize


df= pd.concat([pd.DataFrame(json_dict), pd.DataFrame(list(json_dict['nested_array_to_expand']))], axis=1).drop('nested_array_to_expand', 1)

Comments

2
import pandas as pd
import json

data = '''
[
  {
   "id":1,
   "first_name":"Mead",
   "last_name":"Lantaph",
   "email":"[email protected]",
   "gender":"Male",
   "ip_address":"231.126.209.31",
   "nested_array_to_expand":[
      {
         "property":"Quaxo",
         "json_obj":{
            "prop1":"Chevrolet",
            "prop2":"Mercy Streets"
         }
      },
      {
         "property":"Blogpad",
         "json_obj":{
            "prop1":"Hyundai",
            "prop2":"Flashback"
         }
      },
      {
         "property":"Yabox",
         "json_obj":{
            "prop1":"Nissan",
            "prop2":"Welcome Mr. Marshall (Bienvenido Mister Marshall)"
         }
      }
   ]
  }
]
'''
data = json.loads(data)
result = pd.json_normalize(data, "nested_array_to_expand", 
                           ['email', 'first_name', 'gender', 'id', 'ip_address', 'last_name'])

result


  property json_obj.prop1                                     json_obj.prop2  \
0    Quaxo      Chevrolet                                      Mercy Streets   
1  Blogpad        Hyundai                                          Flashback   
2    Yabox         Nissan  Welcome Mr. Marshall (Bienvenido Mister Marshall)   

                      email first_name gender id      ip_address last_name  
0  [email protected]       Mead   Male  1  231.126.209.31   Lantaph  
1  [email protected]       Mead   Male  1  231.126.209.31   Lantaph  
2  [email protected]       Mead   Male  1  231.126.209.31   Lantaph  

More information about json_normalize: https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.