Parsing JSON strings from API with Pandas

Question

I am trying to extract JSON keys and associated data nested in an API response into a Pandas DataFrame with each key and associated data element as a separate column.

I have tried the solution HERE: Parsing a JSON string which was loaded from a CSV using Pandas

But there are 2 problems. First, I have to convert the API request response into a CSV, and then back from a CSV to a DataFrame, which seems like a wasted step, but I was willing to do it if this would otherwise work.

Second, even when I do that, I get "JSON DecodeError: Expecting property name enclosed in double quotes."

I have also tried and failed with the solution described HERE: instructional at https://www.kaggle.com/jboysen/quick-tutorial-flatten-nested-json-in-pandas

import requests
from pandas import DataFrame
import pandas as pd
import json

teamgamedata_url = 'https://api.collegefootballdata.com/games/teams?year=2019&week=1&seasonType=regular'

teamgamedataresp = requests.get(teamgamedata_url)

dftg = DataFrame(teamgamedataresp.json())

This works, but produces a column 'teams' with lots of nested data so to try to follow the csv advice, I convert to CSV

dftg.to_csv(r'/path/teamgameinfoapi.csv')

def CustomParser(data):
    j1 = json.loads(data)
    return j1

csvtodf = pd.read_csv('/path/teamgameinfoapi.csv', 
                      converters={'teams':CustomParser}, header=0)

csvtodf[sorted(csvtodf['teams'][0].keys())] = csvtodf['teams'].apply(pd.Series)

I expected the CustomParser to extract the JSON data into separate columns, but get:

JSONDecodeError: Expecting property name enclosed in double quotes

I expected the last line of code to append the columns to the dataframe, but instead got:

KeyError: 'teams'

Vishnudev Krishnadas · Accepted Answer · 2019-09-02 18:14:21Z

1

Efficient and more pandas way of doing this

On pandas >= 0.25.1

teamgamedataresp = requests.get(teamgamedata_url)

d = teamgamedataresp.json()

# errors='ignore' used because some records may not have id, in that case it will throw error, I'm ignoring it here
teams_df = pd.io.json.json_normalize(d, ['teams'], ['id'], errors='ignore')
print(teams_df)

teams_df = teams_df.explode('stats')
print(teams_df)

stats_df = pd.io.json.json_normalize(teams_df['stats'])
print(stats_df)

teams_df.drop(columns=['stats'], inplace=True)
data = pd.concat([teams_df.reset_index(drop=True), stats_df.reset_index(drop=True)], axis=1)
print(data)

TL;DR (Data displayed for understanding purposes)

Normalizing first level of records i.e. teams

        school         conference homeAway  points                                              stats         id
0       Vanderbilt                SEC     home       6  [{'category': 'rushingTDs', 'stat': '0'}, {'ca...  401110732
1          Georgia                SEC     away      30  [{'category': 'rushingTDs', 'stat': '2'}, {'ca...  401110732
2            Miami                ACC     home      20  [{'category': 'rushingTDs', 'stat': '1'}, {'ca...  401110723
3          Florida                SEC     away      24  [{'category': 'rushingTDs', 'stat': '1'}, {'ca...  401110723
4    Georgia State           Sun Belt     away      38  [{'category': 'rushingTDs', 'stat': '3'}, {'ca...  401110730
..             ...                ...      ...     ...                                                ...        ...
163           Navy  American Athletic     home      45  [{'category': 'rushingTDs', 'stat': '6'}, {'ca...  401117857
164   Gardner-Webb               None     away      28  [{'category': 'rushingTDs', 'stat': '3'}, {'ca...  401135910
165      Charlotte     Conference USA     home      49  [{'category': 'rushingTDs', 'stat': '4'}, {'ca...  401135910
166  Alabama State               None     away      19  [{'category': 'rushingTDs', 'stat': '1'}, {'ca...  401114237
167            UAB     Conference USA     home      24  [{'category': 'rushingTDs', 'stat': '1'}, {'ca...  401114237

[168 rows x 6 columns]

Exploding list in stats column into rows

         school      conference homeAway  points                                           stats         id
0    Vanderbilt             SEC     home       6         {'category': 'rushingTDs', 'stat': '0'}  401110732
0    Vanderbilt             SEC     home       6         {'category': 'passingTDs', 'stat': '0'}  401110732
0    Vanderbilt             SEC     home       6   {'category': 'kickReturnYards', 'stat': '35'}  401110732
0    Vanderbilt             SEC     home       6      {'category': 'kickReturnTDs', 'stat': '0'}  401110732
0    Vanderbilt             SEC     home       6        {'category': 'kickReturns', 'stat': '2'}  401110732
..          ...             ...      ...     ...                                             ...        ...
167         UAB  Conference USA     home      24  {'category': 'netPassingYards', 'stat': '114'}  401114237
167         UAB  Conference USA     home      24       {'category': 'totalYards', 'stat': '290'}  401114237
167         UAB  Conference USA     home      24    {'category': 'fourthDownEff', 'stat': '0-1'}  401114237
167         UAB  Conference USA     home      24    {'category': 'thirdDownEff', 'stat': '1-13'}  401114237
167         UAB  Conference USA     home      24        {'category': 'firstDowns', 'stat': '16'}  401114237

[3927 rows x 6 columns]

Normalize the stats column to get a dataframe

             category  stat
0          rushingTDs     0
1          passingTDs     0
2     kickReturnYards    35
3       kickReturnTDs     0
4         kickReturns     2
...               ...   ...
3922  netPassingYards   114
3923       totalYards   290
3924    fourthDownEff   0-1
3925     thirdDownEff  1-13
3926       firstDowns    16

[3927 rows x 2 columns]

Finally, Merge both dataframes.

          school      conference homeAway  points         id         category  stat
0     Vanderbilt             SEC     home       6  401110732       rushingTDs     0
1     Vanderbilt             SEC     home       6  401110732       passingTDs     0
2     Vanderbilt             SEC     home       6  401110732  kickReturnYards    35
3     Vanderbilt             SEC     home       6  401110732    kickReturnTDs     0
4     Vanderbilt             SEC     home       6  401110732      kickReturns     2
...          ...             ...      ...     ...        ...              ...   ...
3922         UAB  Conference USA     home      24  401114237  netPassingYards   114
3923         UAB  Conference USA     home      24  401114237       totalYards   290
3924         UAB  Conference USA     home      24  401114237    fourthDownEff   0-1
3925         UAB  Conference USA     home      24  401114237     thirdDownEff  1-13
3926         UAB  Conference USA     home      24  401114237       firstDowns    16

[3927 rows x 7 columns]

edited Sep 2, 2019 at 18:14

answered Sep 2, 2019 at 17:07

Vishnudev Krishnadas

11k2 gold badges29 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

janalytics Over a year ago

Thank you. I am going to work through this after I work through the original response because I want to clearly understand what's going on in each so I develop my own skillset.

Vishnudev Krishnadas Over a year ago

I'll help with that. I'll Update my answer

janalytics Over a year ago

Extraordinarily helpful, @Vishnudev.

Vishnudev Krishnadas Over a year ago

If you find it helpful, encourage by upvoting/accepting as answer. @janalytics

janalytics Over a year ago

in this particular case, the true first level of record is probably game, defined by 'id'. But I see the value in being able to define by other record levels as well. Depending upon use case, this can be far easier to read visually. But I guess that's where learning Panda well is important -- developing the ability to structure dataframes differently depending upon the goal.

|

crayxt · Accepted Answer · 2019-09-02 15:35:28Z

0

I would rather drill down manually, as the structure is complex:

a = teamgamedataresp.json()
buf = []
for game in a:
    for team in game['teams']:
        game_dict = dict(id=game['id'])
        for cat in ('school', 'conference', 'homeAway', 'points'):
            game_dict[cat] = team[cat]
        for stat in team['stats']:
            game_dict[stat['category']] = stat['stat']
        buf.append(game_dict)

>>> df = pd.DataFrame(buf)
>>> df
            id         school         conference homeAway  points  ... puntReturnTDs puntReturns interceptionYards interceptionTDs passesIntercepted
0    401110732     Vanderbilt                SEC     home       6  ...           NaN         NaN               NaN             NaN               NaN
1    401110732        Georgia                SEC     away      30  ...             0           4               NaN             NaN               NaN
2    401110723          Miami                ACC     home      20  ...             0           1                41               0                 2
3    401110723        Florida                SEC     away      24  ...             0           3               NaN             NaN               NaN
4    401110730  Georgia State           Sun Belt     away      38  ...           NaN         NaN                 0               0                 1
..         ...            ...                ...      ...     ...  ...           ...         ...               ...             ...               ...
163  401117857           Navy  American Athletic     home      45  ...             0           1               NaN             NaN               NaN
164  401135910   Gardner-Webb               None     away      28  ...           NaN         NaN                45               0                 3
165  401135910      Charlotte     Conference USA     home      49  ...             1           3               NaN             NaN               NaN
166  401114237  Alabama State               None     away      19  ...             0           2               NaN             NaN               NaN
167  401114237            UAB     Conference USA     home      24  ...             0           2                 0               0                 1

[168 rows x 31 columns]

answered Sep 2, 2019 at 15:35

crayxt

2,4052 gold badges15 silver badges19 bronze badges

4 Comments

janalytics Over a year ago

Hmm. Did you take a step to reorder the columns? Mine come out in a different order. Also, if you'll forgive a newbie question, how do you get your code to display that way (as opposed to the raw form in which mine appeared)?

crayxt Over a year ago

Nope I did not. Could be pandas version? Mine is 0.25.1 in Linux Anaconda Python 3.7 environment. For code formatting, you have to select and mark as code in Editing box. For Pandas version I mean Python version, as since Python 3.6 all dictionaries now are Ordered Dictionaries, which means they preserve order of keys' insertion.

janalytics Over a year ago

Mine is 0.24.2 in Mac Anaconda Python 7 environment. It seems to want to put the columns in alphabetical order, which is a bit annoying. Your solution worked great, but since I'm learning, I want to confirm my understanding of how you attacked this. Will try to post commented code back with the select-and-mark method. Thanks for your patience.

janalytics Over a year ago

having to do this by answering my own question because of space limitations.

janalytics · Accepted Answer · 2019-09-02 18:02:42Z

I was unable to fit this into a comment; I apologize for breaking protocol. Hope I have followed you, @crayxt. Please let me know if I've misunderstood.

a = teamgamedataresp.json()
#defining a as the response to our get request for this data, in JSON format 
buf = []
#Not sure what this means, but I think we're creating a bucket where a new dict can go
for game in a:
    #defining game as the key unit for evaluation in our desired dataset
    for team in game['teams']:
        #telling python that for each unique game we want to identify each
        #team identified by the 'teams' key, and  . . . 
        game_dict = dict(id=game['id'])
        #tells python that our key unit, game, is tied to the game ID#
        for cat in ('school', 'conference', 'homeAway', 'points'):
            #defining the categories where there is no further nesting to explore
            game_dict[cat]=team[cat]
            #we're making sure that python knows that the categories used for the game
            #should be the same one used for the team.
        for stat in team['stats']:
            game_dict[stat['category']] = stat['stat']
            #now we need to dive into the further nesting in the stats category.
            #each team's statistics should be labeled with the JSON 'category" key
            #and given a value equal to the "stat" value key given in JSON
        buf.append(game_dict)
        #adds the newly created stat columns to the dataset

Collectives™ on Stack Overflow

Parsing JSON strings from API with Pandas

3 Answers 3

TL;DR (Data displayed for understanding purposes)

9 Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

TL;DR (Data displayed for understanding purposes)

9 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related