0

I am dealing with a DataFrame (imported from .csv) that I want to convert into a nested JSON, but I am not able to created an additional nested level. I am trying to explain with an example. At the very end of the work the result is imported into MongoDB with pymongo.

----------------------------------------------------
worker_id | gender | employer_id | year | job_type |
----------------------------------------------------
WORK_1    | M      | EMPL_2      | 1990 | Att      |
----------------------------------------------------   
WORK_1    | M      | EMPL_1      | 1991 | Mis      |
---------------------------------------------------- 
WORK_1    | M      | EMPL_1      | 1993 | Att      |
----------------------------------------------------     
WORK_2    | F      | EMPL_3      | 1995 | Att      |
----------------------------------------------------  
WORK_2    | F      | EMPL_3      | 1992 | Mis      |
---------------------------------------------------- 
WORK_2    | F      | EMPL_3      | 1994 | Att      |
----------------------------------------------------  

df = pd.DataFrame({'worker_id':['WORK_1','WORK_1','WORK_1','WORK_2','WORK_2','WORK_2'],
                    'gender':['M','M','M','F','F','F'],
                    'employer_id':['EMPL_2','EMPL_1','EMPL_1','EMPL_3','EMPL_3','EMPL_3'],
                    'year':[1990,1991,1993,1995,1992,1994],
                    'job_type':['Att','Mis','Att','Att','Mis','Att']
                    })

The JSON that I want to obtain should follow a structure like the following one:

{ "worker_id": "WORK_1",
  "gender": "M",
  "job_type" : [
               { "Att": [
                        { 
                          "employer_id": "EMPL_1",
                          "year": 1990
                        },
                        { 
                          "employer_id": "EMPL_2",
                          "year": 1993
                        }
                        ]      
               },      
               { "Mis": [
                        { 
                          "employer_id": "EMPL_1",
                          "year": 1991
                        }
                        ]      
               } 
               ]
},
{ "worker_id": "WORK_2",
  "gender": "F",
  "job_type" : [
               { "Att": [
                        { 
                          "employer_id": "EMPL_3",
                          "year": 1994
                        },
                        { 
                          "employer_id": "EMPL_3",
                          "year": 1995
                        }
                        ]      
               },      
               { "Mis": [
                        { 
                          "employer_id": "EMPL_3",
                          "year": 1992
                        }
                        ]      
               } 
               ]
}

I was able to nest within the category (array) of 'job_type' an object for each specific work contract (each row of the sample should represent a specific work contract, followed by several other variables) by following several helpful discussion here on Stack Overflow. Nonetheless, I want to differentiate also between kind of works (in the example between 'Mis' and 'Att'), then create another nested level.

The code that I used to nest among job contract between workers is the following one.

finalList = []
finalDict = {}

grouped = df.groupby(['worker_id',
                      'gender'
                     ])
for key, value in grouped:

    dictionary = {}

    j = grouped.get_group(key).reset_index(drop = True)
    dictionary['worker_id'] = j.at[0, 'worker_id']
    dictionary['gender'] = j.at[0, 'gender']

    dictList = []
    anotherDict = {}

    for i in j.index:

        anotherDict['employer_id'] = j.at[i, 'employer_id']       
        anotherDict['year'] = j.at[i, 'year']
        anotherDict['job_type'] = j.at[i, 'job_type']

        dictList.append(anotherDict.copy())

    dictionary['job_type'] = dictList

    finalList.append(dictionary)

I hope someone could kindly help me. Thank you in advance!

Update

I tried to enhance the code with the script here below (I followed this thread). Unfortunately, I am still not getting what I want.

# Generates a column for each kind of 'job_type'
df['att'] = ['Att' if x == 'Att' else None for x in df['job_type']] 
df['mis'] = ['Mis' if x == 'Mis' else None for x in df['job_type']] 

# Aggregate for the 'job_type' = 'Mis'
df_att = df.dropna(subset = ['att'])
df_att.drop(columns=['mis'])            
att = (df_att.groupby(['worker_id','gender'], as_index = True)
             .apply(lambda x: x[['employer_id','year','job_type']].to_dict('r'))
             .reset_index()
             .rename(columns = {0:'Att'}))

# Aggregate for the 'job_type' = 'Som'
df_mis= df.dropna(subset = ['mis']) 
df_mis.drop(columns=['att'])
mis = (df_mis.groupby(['worker_id','gender'], as_index = False)
             .apply(lambda x: x[['employer_id','year','job_type']].to_dict('r'))
             .reset_index()
             .rename(columns = {0:'Mis'}))

# Append
df_all = att.append(mis)

# Aggregate for 'worker_id' and 'gender'
j = (df_all.groupby(['worker_id','gender'], as_index = False)
             .apply(lambda x: x[['Att','Mis']].to_dict('r'))
             .reset_index()
             .rename(columns = {0:'job_type'})
             .to_json(orient = 'records'))

print(json.dumps(json.loads(j), indent = 4, sort_keys = True))

What I am getting so far...

[
    {
        "gender": "M",
        "job_type": [
            {
                "Att": [
                    {
                        "employer_id": "EMPL_2",
                        "job_type": "Att",
                        "year": 1990
                    },
                    {
                        "employer_id": "EMPL_1",
                        "job_type": "Att",
                        "year": 1993
                    }
                ],
                "Mis": null
            },
            {
                "Att": null,
                "Mis": [
                    {
                        "employer_id": "EMPL_1",
                        "job_type": "Mis",
                        "year": 1991
                    }
                ]
            }
        ],
        "worker_id": "WORK_1"
    },
    {
        "gender": "F",
        "job_type": [
            {
                "Att": [
                    {
                        "employer_id": "EMPL_3",
                        "job_type": "Att",
                        "year": 1995
                    },
                    {
                        "employer_id": "EMPL_3",
                        "job_type": "Att",
                        "year": 1994
                    }
                ],
                "Mis": null
            },
            {
                "Att": null,
                "Mis": [
                    {
                        "employer_id": "EMPL_3",
                        "job_type": "Mis",
                        "year": 1992
                    }
                ]
            }
        ],
        "worker_id": "WORK_2"
    }
]

1 Answer 1

1

Here's a solution that loops over the unique worker_id values and builds up a list of dictionaries for each worker_id:

import pandas as pd
import json
df = pd.DataFrame({'worker_id':['WORK_1','WORK_1','WORK_1','WORK_2','WORK_2','WORK_2'],
                    'gender':['M','M','M','F','F','F'],
                    'employer_id':['EMPL_2','EMPL_1','EMPL_1','EMPL_3','EMPL_3','EMPL_3'],
                    'year':[1990,1991,1993,1995,1992,1994],
                    'job_type':['Att','Mis','Att','Att','Mis','Att']})

df_G=df[['worker_id','gender']].drop_duplicates()
all_dicts=[]
for indx,vals in df_G.iterrows():
    this_dict=vals.to_dict()
    job_dict=(df[df.worker_id==vals['worker_id']]
        .groupby(['job_type']).apply(lambda x: x[['employer_id','year']]
        .to_dict('r')).to_dict())
    this_dict['job_type']=[]
    for key,val in job_dict.items():         
        print({key:val})
        this_dict['job_type'].append({key:val})
    all_dicts.append(this_dict)

(df[df.worker_id==vals['worker_id']].groupby(['job_type']).apply(lambda x: x[['employer_id','year']].to_dict('r')))

print(json.dumps(all_dicts, indent = 4, sort_keys = True))

Prints out:

[
    {
        "gender": "M",
        "job_type": [
            {
                "Mis": [
                    {
                        "employer_id": "EMPL_1",
                        "year": 1991
                    }
                ]
            },
            {
                "Att": [
                    {
                        "employer_id": "EMPL_2",
                        "year": 1990
                    },
                    {
                        "employer_id": "EMPL_1",
                        "year": 1993
                    }
                ]
            }
        ],
        "worker_id": "WORK_1"
    },
    {
        "gender": "F",
        "job_type": [
            {
                "Mis": [
                    {
                        "employer_id": "EMPL_3",
                        "year": 1992
                    }
                ]
            },
            {
                "Att": [
                    {
                        "employer_id": "EMPL_3",
                        "year": 1995
                    },
                    {
                        "employer_id": "EMPL_3",
                        "year": 1994
                    }
                ]
            }
        ],
        "worker_id": "WORK_2"
    }
]

Perhaps not the most efficient or pythonic, but it works. And if I remember pymongo correctly you can pass it the list of dicts to insert.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.