Flatten json dynamically using python, all nested keys in a column and value in another

Question

I have a requirment to flatten json into keys and values using pyspark/python, so all the nested keys goes into a column value and the corresponding values goes into another column.

Also to note input json is dynamic, so in below sample there could be multiple subkey and child keys. Appreciate if anyone can help on this

sample json Input:

{
    "key1": {
        "subkey1":"1.1",
        "subkey2":"1.2"
        },
    "key2": {
        "subkey1":"2.1",
        "subkey2":"2.2",
        "subkey3": {"child3": { "subchild3":"2.3.3.3" } }
        },
    "key3": {
        "subkey1":"3.1",
        "subkey2":"3.2"
        }
}

Expected Output: To flatten only key2 from the nested keys

ID	key	value
1	key2.subkey1	2.1
2	key2.subkey2	2.2
3	key2.subkey3.child3.subchild3	2.3.3.3

Is the result you seek a pandas.DataFrame() or a python list of lists or something else? — JonSG
– JonSG, Commented Feb 2, 2023 at 16:17
@JonSG I did try with normalize but got an error 'Column' object is not callable' input_df= spark.read.option("multiline","true").json("filename") df = pd.json_normalize(input_df) — Venk AV
– Venk AV, Commented Feb 3, 2023 at 8:27

score 1 · Accepted Answer · 2023-02-03 13:09:04Z

1

The following code provides you with all you need to accomplish what you want to achieve:

data = \
  { "key1": {
         "subkey1":"1.1",
         "subkey2":"1.2"
            },

    "key2": {
        "subkey1":"2.1",
        "subkey2":"2.2",
        "subkey3": {
                    "child3": { 
                               "subchild3":"2.3.3.3" 
                              } 
                   }
          }
  }
print(data)
ID      = 0
lstRows = []
def getTableRow(data, key):
    global lstRows, ID
    for k, v in data.items():
        #print('for k,v:', k,v)
        if isinstance(v, dict):
            #print('dict:',v)
            if key=='': 
                getTableRow(v, k)
            else:
                getTableRow(v, key +'.'+ k)
        else:
            #print('lstRows.append()')
            ID += 1
            lstRows.append({"ID":ID, "key":key +'.'+ k, "value":v})
getTableRow(data, '') 
print( lstRows )
dctTable = {"ID":[],"key":[], "value":[]}
for dct in lstRows:
    dctTable["ID"].append(dct["ID"])
    dctTable["key"].append(dct["key"])
    dctTable["value"].append(dct["value"])
print( dctTable )

import pandas as pd
df = pd.DataFrame.from_dict(dctTable)
# df = pd.DataFrame(lstRows)  # equivalent to above .from_dict()
# df = pd.DataFrame(dctTable) # equivalent to above .from_dict()
print(df)

prints

{'key1': {'subkey1': '1.1', 'subkey2': '1.2'}, 'key2': {'subkey1': '2.1', 'subkey2': '2.2', 'subkey3': {'child3': {'subchild3': '2.3.3.3'}}}}
[{'ID': 1, 'key': 'key1.subkey1', 'value': '1.1'}, {'ID': 2, 'key': 'key1.subkey2', 'value': '1.2'}, {'ID': 3, 'key': 'key2.subkey1', 'value': '2.1'}, {'ID': 4, 'key': 'key2.subkey2', 'value': '2.2'}, {'ID': 5, 'key': 'key2.subkey3.child3.subchild3', 'value': '2.3.3.3'}]
{'ID': [1, 2, 3, 4, 5], 'key': ['key1.subkey1', 'key1.subkey2', 'key2.subkey1', 'key2.subkey2', 'key2.subkey3.child3.subchild3'], 'value': ['1.1', '1.2', '2.1', '2.2', '2.3.3.3']}
   ID                            key    value
0   1                   key1.subkey1      1.1
1   2                   key1.subkey2      1.2
2   3                   key2.subkey1      2.1
3   4                   key2.subkey2      2.2
4   5  key2.subkey3.child3.subchild3  2.3.3.3

It uses a recursive call of a function creating the rows of the resulting table.

As I don't use pyspark the shown table was created using Pandas.

See also here ( "Flatten nested dictionaries, compressing keys" ) for a general and flexible way of flattening a nested dictionary handling also values being lists.

See below code for further instructions and explanations requested in the comments:

# ======================================================================
# You can read the json file content directly into the dct_data using
# the Python json.load(fp) function ( fp=open(filename) ). 

# This code starts with an in str_data stored json file content:    
str_data = """
  { "key1": {
         "subkey1":"1.1",
         "subkey2":"1.2"
            },

    "key2": {
        "subkey1":"2.1",
        "subkey2":"2.2",
        "subkey3": {
                    "child3": { 
                               "subchild3":"2.3.3.3" 
                              } 
                   }
          }
  }"""
# print(str_data)

# Let's create a Python dictionary from the json data string:  
import json
dct_data = json.loads(str_data) # or = json.load(open(filename))
print(dct_data)

# Here the function for flattening the dictionary dct_data returning 
# a dictionary with flattened dct_data content: 
def flattenNestedDictionary(dct_data, key='', ID=0, lstRows=[]):
    #global lstRows, ID
    for k, v in dct_data.items():
        #print('for k,v:', k,v)
        if isinstance(v, dict):
            #print('dict:',v)
            if key=='': 
                flattenNestedDictionary(v, k, ID, lstRows)
            else:
                flattenNestedDictionary(v, key +'.'+ k, ID, lstRows)
        else:
            #print('lstRows.append()')
            ID += 1
            lstRows.append({"ID":ID, "key":key +'.'+ k, "value":v})

    # now lstRows has all the required content so let's create the
    # flattened dictionary:
    if ID==0:  
        print('lstRows:', lstRows )
        dct_flattened_json = {"ID":[],"key":[], "value":[]}
        for dct in lstRows:
            dct_flattened_json["ID"].append(dct["ID"])
            dct_flattened_json["key"].append(dct["key"])
            dct_flattened_json["value"].append(dct["value"])
        print('#', dct_flattened_json )
        return dct_flattened_json

dct_flattened_json = flattenNestedDictionary(dct_data)

# Let's create a valid json data string out of the dictionary: 
str_flattened_json = json.dumps(dct_flattened_json)
print('>', str_flattened_json)

# you can now write the str_flattened_json string to a file and load 
# the new json file with flattened data into spark DataFrame. 
# Or you load the string  str_flattened_json  into a spark DataFrame.

edited Feb 3, 2023 at 13:09

answered Feb 2, 2023 at 15:20

user7711283

Sign up to request clarification or add additional context in comments.

7 Comments

Venk AV Over a year ago

above code works if the input is provided within the code but not when I read as a file, I am using spark.read.option("multiline","true").json("filename") Error 'DataFrame' object has no attribute 'item', also tried with 'iterrows' it errored again

user7711283 Over a year ago

If you don't provide the json file to start with and try to use a spark DataFrame instead of a dictionary in code expecting a Python dictionary no wonder you get errors. Get a dictionary from the json file to get what you have posted in you question as sample json input to avoid errors. Or post an actual example of a json file if you don't know how get a json file content into a Python dictionary.

Venk AV Over a year ago

thanks for your assistance but one more help on the example provided above, how do I get data related to key2 only instead of key1. If i try with for k, v in data['key1'].items(): it works but does not work when it's for key2

user7711283 Over a year ago

What about deleting data['key1'] before calling getTableRow function? key2 values need the recursive function call within the function to handle the nested data, where key1 values are not further nested. ( P.S. to be honest I don't understand what you are asking for and what does not work when it's for key2? )

user7711283 Over a year ago

See my updated answer for additional explanation and generally the same code as above but rewritten with the purpose of better understanding how to use it.

|

Collectives™ on Stack Overflow

Flatten json dynamically using python, all nested keys in a column and value in another

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related