0

I have some json with multiple blocks with this format (using only one here to make it simple, so in this example the dataframe would have only one line):

{
      "A": 1,
      "B": {
        "C": [
          {
            "D": 2,
            "E": 3
          }
        ],
        "F": {
          "G": 4,
          "H": 5
        }
      }
}

And I want to create a DataFrame like this:

  A B.C.D B.C.E B.F.G B.F.H
1 1   2     3     4     5

When I try to do

with open('some.json') as file:
    data = json.load(file)

df = pd.json_normalize(data)

I get something like this:

  A        B.C          B.F.G  B.F.H
1 1  [{"D":2,"E":3}]      4      5

So... I can get the column B.C, break it into the B.C.D and B.C.E

df2 = pd.DataFrame(df['B.C'].tolist())
df3 = df2[0].apply(pd.Series) #The [0] here is the only way to work when I have more than one block in the json

Them later concatenate with the previous dataframe (and removing the B.C column) but it feels ugly and since I'm doing this a LOT I was thinking if there's a cleaner/faster way.

Well, thanks in advance!

1

3 Answers 3

2

I guess you could write a recursive solution to preprocess the data. There might be some existing inbuilt solution but I am not aware of it. You can test the performance of the following:

def clean(data):
    clean_data = {}
    def parse(data, key=''):
        if isinstance(data, list):
            for elem in data:
                parse(elem,key=key)
        else:
            for k, v in data.items():
                if isinstance(v, (dict, list)):
                    parse(v, key=key+'.'+k)
                else:
                    clean_data[(key+'.'+k).lstrip('.')] = v
    parse(data)
    return [clean_data]

data = {'A': 1, 'B': {'C': [{'D': 2, 'E': 3}], 'F': {'G': 4, 'H': 5}}}
print(pd.DataFrame(clean(data)))

Output:

   A  B.C.D  B.C.E  B.F.G  B.F.H
0  1      2      3      4      5
Sign up to request clarification or add additional context in comments.

Comments

1

Write yourself a recursive function:

def get_values(iterable, root=""):
    if isinstance(iterable, dict):
        for key, values in iterable.items():
            if isinstance(values, (dict, list)):
                new_root = "{}.{}".format(root, key) if root else key
                yield from get_values(values, new_root)
            else:
                absolute_key = "{}.{}".format(root, key) if root else key
                yield absolute_key, values
    elif isinstance(iterable, list):
        for dct in iterable:
            yield from get_values(dct, root)

result = [item for item in get_values(data)]
print(result)

Which yields

[('A', 1), ('B.C.D', 2), ('B.C.E', 3), ('B.F.G', 4), ('B.F.H', 5)]

To transform it into a DataFrame, use:

result = dict([item for item in get_values(data)])

import pandas as pd
df = pd.DataFrame(result, index=[0])
print(df)

Which then yields

   A  B.C.D  B.C.E  B.F.G  B.F.H
0  1      2      3      4      5

Comments

1

You should checkout flatten JSON. It's the best way to flatten JSON's with multiple record paths

import flatten_json

with open('1.json', 'r+') as f:
    data = json.load(f)
dic_flattened = flatten_json.flatten(data)
df = pd.json_normalize(dic_flattened)
print(df)

   A  B_C_0_D  B_C_0_E  B_F_G  B_F_H
0  1        2        3      4      5

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.