4

I'm trying to create a pandas dataframe form json file. I've seen a multiple solutions to this problem which uses built in functions from_dict/json_normalize yet I'm unable to apply it to my code. Here's how my data is structured in json file:

     "data": [
   {
      "groups": {
         "data": [
               {
               "group": "Math",
               "year_joined": "2009"
               },
               {
               "group_name": "History",
               "year_joined": "2011"
               },
               {
               "group_name": "Biology",
               "year_joined": "2010"
               }
         ]
      },
      "id": "12512"
   },

When I'm trying to normalize this data with pandas function like this:

path = 'mypath'
f = open(path)
data = json.load(f)

test = pd.json_normalize(
            data['data'], 
            errors='ignore') 

I just receive something like this:

    id      groups.data
0   12512   [{'group_name': 'Math', 'year_joined': '2009', 'gr...
1   23172   [{'group_name': 'Chemistry', 'year_joined': '2005'...

I want this data to look like this (solution 1):

    id      group     year_joined
0   12512   group1    year1
1   12512   group2    year2
2   12512   group3    year3

Or like this (solution 2):

    id      group                   year_joined
0   12512   group1,group2,group3    year1,year2,year3
1   23172   group4,group5           year4,year5

How can i achieve it? I tried passing 'record_path' parameter to 'json_normalize' function but it doesn't change anything. I tried to use 'DataFrame.from_dict' function to work around this but I failed. The only way I was able to get to solution 1 was to just create multiple loops that iterated through everything in json file and add it to separate list. It kinda works but takes a lot of time on bigger datasets.

How could i use built-in pandas tools to process files which are nested as dictionaries in 3rd layer of the file as presented above?

3 Answers 3

4
  • given you have dict with nested list
    1. create dataframe from overall structure
    2. explode() embedded list
    3. expand nested dict with apply(pd.Series)
d = {'groups': {'data': [{'group': 'Math', 'year_joined': '2009'},
   {'group_name': 'History', 'year_joined': '2011'},
   {'group_name': 'Biology', 'year_joined': '2010'}]},
 'id': '12512'}

pd.json_normalize(d).explode("groups.data").reset_index(drop=True).pipe(
    lambda d: d["id"].to_frame().join(d["groups.data"].apply(pd.Series))
)

id group year_joined group_name
0 12512 Math 2009 nan
1 12512 nan 2011 History
2 12512 nan 2010 Biology
Sign up to request clarification or add additional context in comments.

Comments

1

You need to collect the information from the data dictionary

solution 1

d = {}
for group in data["data"]:
    groups = [x["group_name"] for x in group['groups']["data"]]
    d['id'] = d.get('id', []) + [group['id']] * len(groups)
    d['group'] = d.get('group', []) + groups
    d['year_joined'] = d.get('year_joined', []) + [x["year_joined"] for x in group['groups']["data"]]

df = pd.DataFrame(d)

Output

      id      group year_joined
0  12512       Math        2009
1  12512    History        2011
2  12512    Biology        2010
3  23172  Chemistry        2007
4  23172  Economics        2008

solution 2

d = {}
for group in data["data"]:
    d['id'] = d.get('id', []) + [group['id']]
    d['group'] = d.get('group', []) + [','.join(x["group_name"] for x in group['groups']["data"])]
    d['year_joined'] = d.get('year_joined', []) + [','.join(x["year_joined"] for x in group['groups']["data"])]

df = pd.DataFrame(d)

Output

      id                 group     year_joined
0  12512  Math,History,Biology  2009,2011,2010
1  23172   Chemistry,Economics       2007,2008

Comments

0

This seems to work for your example:

data = [ # Original data from question
   {
      "groups": {
         "data": [
               {
               "group": "Math",
               "year_joined": "2009"
               },
               {
               "group_name": "History",
               "year_joined": "2011"
               },
               {
               "group_name": "Biology",
               "year_joined": "2010"
               }
         ]
      },
      "id": "12512"
   },
]
# Use the record_path to extract the list we are interested in, and make sure we retain ID
df = pandas.json_normalize(data, record_path=['groups','data'], meta=['id'])
# Combine the group and group_name columns into a single column as they appear mutually exclusive
df["group"] = df["group_name"].fillna(df["group"])
# Discard the now unnecessary column
df.drop(columns='group_name', inplace=True)

It gives:

year_joined group id
0 2009 Math 12512
1 2011 History 12512
2 2010 Biology 12512

To create the second dataframe:

df.groupby(['id']).agg({'year_joined':list,'group':list})
id year_joined group
12512 ['2009', '2011', '2010'] ['Math', 'History', 'Biology']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.