2

I am trying to parse a nested json.

I've got the dataset stored here so that you can see what I'm seeing specifically if you want: https://mega.nz/file/YWNSRBjK#V9DpoY5LSp-VL8Mnu7NEfNf3FhDOCj9FHBiTQ4KHEa8

I am attempting to parse this using pandas json_normalize function. Below is what my code looks like in it's entirety.

import gzip   
import shutil
import json
import pandas as pd

with gzip.open('testjson.json.gz', 'rb') as f_in:
    with open('unzipped_json.json', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

f = open('unzipped_json.json')
data = json.load(f)
keys = data.keys()
keys_string = list(keys)
 
### In Network
in_network_df = pd.json_normalize(data['in_network'])

### Negotiated Rates
negotiated_rates_df = pd.json_normalize(data=data['in_network'],
                                        record_path=("negotiated_rates"))
negotiated_rates_df = negotiated_rates_df.explode('provider_references')
negotiated_rates_df = negotiated_rates_df.explode('negotiated_prices')

### Negotiated Prices
negotiated_prices_df = pd.json_normalize(data=data['in_network'],
                                         meta=[
                                             #['negotiated_rates','provider_references'],
                                            # ['negotiation_arrangement', 'name','billing_code_type','billing_code','description']
                                             ],
                                        record_path=['negotiated_rates','negotiated_prices'],
                                        errors='ignore')
negotiated_prices_df = negotiated_prices_df.explode('service_code')

### Provider References
provider_references_df = pd.json_normalize(data['provider_references'])
provider_references_test = provider_references_df.explode('provider_groups')

### Provider Groups
provider_groups = pd.json_normalize(data=data['provider_references'],
                                    meta=['provider_group_id'],
                                        record_path=("provider_groups"))
provider_groups = provider_groups.explode('npi')

I am specifically having trouble with the negotiated prices part of this json object. I am trying to add in some data from parent objects, but it is giving me an error. To point out specifically what I would like to do here it is below.

negotiated_prices_df = pd.json_normalize(data=data['in_network'],
                                         meta=['provider_references'],
                                        record_path=['negotiated_rates','negotiated_prices'],
                                        errors='ignore')

When I try to do this I get ValueError: operands could not be broadcast together with shape (74607,) (24869,)

Can anyone help me understand what is going on here?

Edit: Trying to provide some more context in case someone is not wanting to open my file... Here is one spot showing the problematic portion I'm dealing with in the JSON. I can't seem to get the provider_references to attach to any of the child objects.

"provider_references":[261, 398, 799],"negotiated_prices":[{"negotiated_type": "fee schedule","negotiated_rate": 296.00,"expiration_date": "2023-06-30","service_code": ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12", "13",

1 Answer 1

2
+50

I think the code that you want looks like this:

with open('unzipped_json.json') as f:
    data = json.load(f)

negotiated_rates_and_prices_df = pd.json_normalize(
    data["in_network"],
    record_path=["negotiated_rates", ["negotiated_prices"]],
    meta=[
        "negotiation_arrangement",
        "name",
        "billing_code_type",
        "billing_code_type_version",
        "billing_code",
        "description",
        ["negotiated_rates", "provider_references"],
    ],
)

That takes care of the in_network part of the JSON. The trick is that within the metadata path you want to put the columns which are not nested in a regular list, and the nested ones in the order of nesting (ie ["negotiated_rates", "provider_references"]). There's a similar example in the docs here.

Then for the other nested part of the JSON you can do this:

provider_references_df = pd.json_normalize(
    data["provider_references"], "provider_groups", "provider_group_id"
)

And that takes care of the whole thing.

Sign up to request clarification or add additional context in comments.

6 Comments

I was actually struggling to understand why simply referring provider_references in meta doesn't work - you're supposed to specify the full nesting path. Your answer clarified it. Thank you and +1.
I am getting the same error I was before "operands could not be broadcast together with shape..." But this ran and produced the expected output on your end?
Yes. This is on pandas version 1.4.3, which version are you using? Are you sure you haven't overridden a variable like data with something else? eg maybe prefix with open('unzipped_json.json') as f: data = json.load(f)
Not overriding anything as far as I can tell. I am on 1.1.3 so I wonder if that's the issue, seems kind of strange to me but I'll have to do some testing.
Yeah that's on me, sorry about the confusion. Just finishing up my testing to make sure this is working as expected now and will award bounty. Thank you for the help!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.