1

I am trying to pull out an element from this JSON data and format it into another column in my pandas DataFrame.

Here is the code I have so far:

#Import libraries
import json
import requests
from IPython.display import JSON
import pandas as pd 

#Load data
astronaut_db_url = 'https://supercluster-iadb.s3.us-east-2.amazonaws.com/adb.json'
astronauts_db = requests.get(astronaut_db_url).json()

#Format data
df = pd.json_normalize(astronauts_db['astronauts'])

df_astro = df[['_id','astroNumber','awards','name','gender','inSpace','overallNumber','spacewalkCount','species','speciesGroup',
                'totalMinutesInSpace','totalSecondsSpacewalking','lastLaunchDate.utc']]


#Get row per award
df_awards = df_astro.explode(['awards']).reset_index(drop=True)
df_awards.head()

df_awards['awards'][0]['title']

I want to grab the title of the award for each astronaut in my DataFrame and create a new column with the list of awards in one cell that looks like the following:

Astronaut_ID    Awards
dh3405kdmnd     [First Person In Space, First Person to Cross Karman Line]
ert549fkfl3     [Crossed Karman Line, First Person on Moon]

My idea for tackling this problem was to:

  1. Get a row for each award for every astronaut
  2. Strip the JSON cells down to just the title
  3. Recombine in one cell per astronaut

I am not sure how to complete step 2 of this process. Can someone help point me in the right direction?

2 Answers 2

1

I'd go for using awards as a list of dictionaries and apply the function to every element of it.

import json
import requests
from IPython.display import JSON
import pandas as pd

#Load data
astronaut_db_url = 'https://supercluster-iadb.s3.us-east-2.amazonaws.com/adb.json'
astronauts_db = requests.get(astronaut_db_url).json()

#Format data
df = pd.json_normalize(astronauts_db['astronauts'])

df_astro = df[['_id','astroNumber','awards','name','gender','inSpace','overallNumber','spacewalkCount','species','speciesGroup',
                'totalMinutesInSpace','totalSecondsSpacewalking','lastLaunchDate.utc']]

#Get row per award
df_awards = df_astro[['_id', 'awards']].copy()
df_awards['awards'] = df_awards['awards'].apply(lambda awards: [award['title'] for award in awards])
df_awards.columns = ['Astronaut_ID', 'Awards']

print(df_awards.head())
Sign up to request clarification or add additional context in comments.

Comments

1

Instead of doing steps 1-2, you can pass in record_path and meta directly into json_normalize. Then step 3 can be done using groupby + agg(list):

df_awards = pd.json_normalize(astronauts_db['astronauts'], 'awards', '_id').groupby('_id', as_index=False)['title'].agg(list)
print(df_awards.head(5))

Output:

                                    _id                                                 title  
0  0554c903-e8a6-43c5-8da8-76fb3495e93f     [First Steppe Tortoise  (Agrionemys horsfieldii)]  
1  0729eec8-ae2f-44a5-900f-08b2f491c8fe                    [Crossed Kármán Line, ISS Visitor]  
2  0ff02f81-a865-465d-97b8-cd6be84c56aa     [Crossed Kármán Line, ISS Visitor, Space Resid...  
3  157edd2d-58a0-4f47-b85d-4c6ade14a973                                 [Crossed Kármán Line]  
4  15c82ce2-10d5-45e7-848e-6df388307e1f                                 [Crossed Kármán Line]  

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.