Read nested JSON into Pandas DataFrame

Question

Background Info -
I have a JSON response from an API call, which I am trying to save in a pandas DataFrame, whilst maintaining the same structure, as when I view in a system I have called the data from.

Functions that calls JSON Response -
def api_call(): calls the API (Note: url_list only contains 1x url at present) and saves the response in the api_response variable, using json.loads(response.text)

def api_call():
    url_list = url_constructor()
    for url in url_list:
        response = requests.get(url_list[0], auth = HTTPBasicAuth(key, secret), headers={"Firm":"583"})
    api_response = json.loads(response.text)
    return api_response

Function that saves response to file and also returns it:
def response_writer(): saves api_response as a JSON file. It also returns api_response.

def response_writer():
    api_response = api_call()
    timestr = datetime.datetime.now().strftime("%Y-%m-%d-%H:%M")
    filename = 'api_response_'+timestr+'.json'
    with open(filename, 'w') as output_data:
        json.dump(api_response, output_data)
        print("-------------------------------------------------------\n", 
              "API RESPONSE SAVED:", filename, "\n-------------------------------------------------------")
    return api_response

JSON Response -

{
  "meta": {
    "columns": [
      {
        "key": "node_id",
        "display_name": "Entity ID",
        "output_type": "Word"
      },
      {
        "key": "bottom_level_holding_account_number",
        "display_name": "Holding Account Number",
        "output_type": "Word"
      },
      {
        "key": "value",
        "display_name": "Adjusted Value (USD)",
        "output_type": "Number",
        "currency": "USD"
      },
      {
        "key": "node_ownership",
        "display_name": "% Ownership",
        "output_type": "Percent"
      },
      {
        "key": "model_type",
        "display_name": "Model Type",
        "output_type": "Word"
      },
      {
        "key": "valuation",
        "display_name": "Valuation (USD)",
        "output_type": "Number",
        "currency": "USD"
      },
      {
        "key": "_custom_jb_custodian_305769",
        "display_name": "JB Custodian",
        "output_type": "Word"
      },
      {
        "key": "top_level_owner",
        "display_name": "Top Level Owner",
        "output_type": "Word"
      },
      {
        "key": "top_level_legal_entity",
        "display_name": "Top Level Legal Entity",
        "output_type": "Word"
      },
      {
        "key": "direct_owner",
        "display_name": "Direct Owner",
        "output_type": "Word"
      },
      {
        "key": "online_status",
        "display_name": "Online Status",
        "output_type": "Word"
      },
      {
        "key": "financial_service",
        "display_name": "Financial Service",
        "output_type": "Word"
      },
      {
        "key": "_custom_placeholder_461415",
        "display_name": "Placeholder or Fee Basis",
        "output_type": "Boolean"
      },
      {
        "key": "_custom_close_date_411160",
        "display_name": "Account Close Date",
        "output_type": "Date"
      },
      {
        "key": "_custom_ownership_audit_note_425843",
        "display_name": "Ownership Audit Note",
        "output_type": "Word"
      }
    ],
    "groupings": [
      {
        "key": "holding_account",
        "display_name": "Holding Account"
      }
    ]
  },
  "data": {
    "type": "portfolio_views",
    "attributes": {
      "total": {
        "name": "Total",
        "columns": {
          "direct_owner": null,
          "node_ownership": null,
          "online_status": null,
          "_custom_ownership_audit_note_425843": null,
          "model_type": null,
          "_custom_placeholder_461415": null,
          "top_level_owner": null,
          "_custom_close_date_411160": null,
          "valuation": null,
          "bottom_level_holding_account_number": null,
          "_custom_jb_custodian_305769": null,
          "financial_service": null,
          "top_level_legal_entity": null,
          "value": null,
          "node_id": null
        },
        "children": [
          {
            "entity_id": 4754837,
            "name": "Apple Holdings Adv (748374923)",
            "grouping": "holding_account",
            "columns": {
              "direct_owner": "Apple Holdings LLC",
              "node_ownership": 1,
              "online_status": "Online",
              "_custom_ownership_audit_note_425843": null,
              "model_type": "Holding Account",
              "_custom_placeholder_461415": false,
              "top_level_owner": "Forsyth Family",
              "_custom_close_date_411160": null,
              "valuation": 10423695.609450001,
              "bottom_level_holding_account_number": "748374923",
              "_custom_jb_custodian_305769": "Laverockbank",
              "financial_service": "laverockbankcustodianservice",
              "top_level_legal_entity": "Apple Holdings LLC",
              "value": 10423695.609450001,
              "node_id": "4754837"
            },
          }
        ]
      }
    }
  },
  "included": []
}

Expected structure of JSON in Pandas DataFrame -
This is the structure I am trying to convey in my pandas DataFrame -

| Holding Account                 | Entity ID | Holding Account Number | Adjusted Value (USD) | % Ownership | Model Type      | Valuation (USD) | JB Custodian | Top Level Owner | Top Level Legal Entity          | Direct Owner                    | Online Status | Financial Service   | Placeholder or Fee Basis | Account Close Date | Ownership Audit Note |
|---------------------------------|-----------|------------------------|----------------------|-------------|-----------------|-----------------|--------------|-----------------|---------------------------------|---------------------------------|---------------|---------------------|--------------------------|--------------------|----------------------|
| Apple Holdings Adv (748374923)  | 4754837   | 748374923              | $10,423,695.06       | 100.00%     | Holding Account | $10,423,695.06  | BRF          | Forsyth Family  | Apple Holdings Partners LLC     | Apple Holdings Partners LLC     | Online        | custodianservice    | No                       | -                  | -                    |

My interpretation of the JSON Structure -
It looks like I need to concentrate on {'columns: (which has the column headers), and the 'children' (which represent rows of data, in my case, just 1x row) of 'data':. I can ignore 'groupings': [{'key': 'holding_account', 'display_name': 'Holding Account'}]},, as this ultimately is how the data sorted in the system.

Does anyone have advice on how I might take the JSON and load into a DataFrame with the demonstrated structure?

My interpretation is that I need to set the display_names [columns] as headers and then map the respective children values under each respective display_names / headers. Note: Ordinarily, there would be more children (representing each line of data for my DataFrame), however I have stripped all but 1x out, to make it easier to interpret.

Check out Glom (pypi.org/project/glom) which can help you format the data into the format needed and then use Pandas to create a dataframe from that. — 576i
– 576i, Commented Dec 5, 2021 at 8:31

Mahrkeenerh · Accepted Answer · 2021-12-05 11:11:33Z

1

I'm not sure this is the best way to unpack a dictionary, but it works:
(It's used to keep the child "metadata" like the id (duplicate), and holding account full name)

def unpack_dict(item, out):
    for k, v in item.items():
        if type(v) == dict:
            unpack_dict(v, out)
        else:
            out[k] = v
    return out

Now we need to use this on every single child to get the data

From your example, it seems like you want to keep the Holding account (from child), but you don't want the entity_id, as it is duplicated in node_id?

Not sure, so I'll just include all columns with their "raw" names

columns = unpack_dict(res["data"]["attributes"]["total"]["children"][0]
children = res["data"]["attributes"]["total"]["children"]
data = []

for i in children:
    data.append(list(unpack_dict(i, {}).values()))

And creating a dataframe from that:

>>> pd.DataFrame(data=data, columns = columns)
   entity_id                            name  ...         value  node_id
0    4754837  Apple Holdings Adv (748374923)  ...  1.042370e+07  4754837

[1 rows x 18 columns]

This can be now altered to have the display names instead of these raw ones. You might need to remove some columns though, as I mentioned above, the id is duplicated, you mentioned grouping etc.

If you are dealing with large amounts of data (thousands of entries) and it takes long time to parse it, the excessive columns can be removed before inserting into data to save some time later on.

To rename a columns using a dict:

df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})

edited Dec 5, 2021 at 11:11

answered Dec 5, 2021 at 11:04

Mahrkeenerh

1,1201 gold badge11 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

William Over a year ago

Many thanks for your contribution. Your explanation makes sense, however can you give context on how to integrate this function into my code? The API response is stored in a variable called 'api_response' Note: I have inc. the functions that call and store the API response in my question.

Mahrkeenerh Over a year ago

@William right, instead of using res use your api_response and it should work, since it's a dict

William Over a year ago

Thanks for confirming. Also, how should this be structured i.e., the placement of the 'def unpack_dict(item, out):' function and proceeding code you mentioned? I'm just trying to understand how to piece together, so I can learn and understand how it works and of course test!

Mahrkeenerh Over a year ago

@William yes, you can just place it one after another, and then with the final pd.DataFrame(data=data, columns = columns) you will get the dataframe table

rosa b. · Accepted Answer · 2021-12-07 11:02:08Z

0

+100

I suggest using pd.json_normalize() ( https://pandas.pydata.org/pandas-docs/version/1.2.0/reference/api/pandas.json_normalize.html ) which helps transform JSON data into a pandas DataFrame.

Note 1: Following I assume the data is available in a python dictionary called data. For testing purpose I used

import json
json_data = '''
{
  "meta": {
      # ....
  },
  #...
  "included": []
}
'''
data = json.loads(json_data)

where json_data is your JSON response. As json.loads() doesn't accept a final comma, I omitted the comma after the children object.

pd.json_normalize() offers different options. One possibility is to simply read all "children" data and then drop the columns that are not required. Also, after normalizing some columns have a prefix "columns." which needs to be removed.

import pandas as pd
df = pd.json_normalize(data['data']['attributes']['total']['children'])
df.drop(columns=['grouping', 'entity_id'], inplace=True)
df.columns = df.columns.str.replace(r'columns.', '')

Finally, the columns names need to be replaced with those in the "columns" data:

column_name_mapper = {column['key']: column['display_name'] for column in data['meta']['columns']}
df.rename(columns=column_name_mapper, inplace=True)

Note 2: There are some slight deviations from the expected structure you described. Most notably the word 'name' (with the row value "Apple Holdings Adv (748374923)") in the data frame header is not changed to 'Holding Account' as both terms are not found in the columns list. Some other values simply differ between the described JSON response and the expected structure.

edited Dec 7, 2021 at 11:02

answered Dec 5, 2021 at 20:40

rosa b.

2,1613 gold badges19 silver badges23 bronze badges

3 Comments

William Over a year ago

Thank you for your contribution. I created a function which returns the api response and stores in 'api_response' and then tried 'data = json.loads(api_response)'. However, I get a 'TypeError: the JSON object must be str, bytes or bytearray, not dict' message. Any ideas why?

rosa b. Over a year ago

json.loads() expects a str, bytes or bytearray... So I actually passed the JSON response as a string to the function (using triple quotes as it spans multiple lines). I clarified that part in the answer. This worked for testing. Still, in your specific context some other solution is probably more appropriate, depending on how you receive the json from the API call.

William Over a year ago

Thanks rosa - I misread I'm afraid. Thanks for the great explanation; I have learnt a lot!

Collectives™ on Stack Overflow

Read nested JSON into Pandas DataFrame

2 Answers 2

4 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related