Flattening nested JSON included embedded array in Python using Pandas

Question

I have a JSON-array from a mongoexport containing data from the Beddit sleeptracker. Below is an example of one of the truncated documents (removed some unneeded detail).

    {
        "user" : "xxx",
        "provider" : "beddit",
        "date" : ISODate("2016-11-30T23:00:00.000Z"),
        "data" : [ 
            {
                "end_timestamp" : 1480570804.26226,
                "properties" : {
                    "sleep_efficiency" : 0.8772404,
                    "resting_heart_rate" : 67.67578,
                    "short_term_resting_heart_rate" : 61.36963,
                    "activity_index" : 50.51958,
                    "average_respiration_rate" : 16.25667,
                    "total_sleep_score" : 64,
                },
                "date" : "2016-12-01",
                "session_range_start" : 1480545636.55059,
                "start_timestamp" : 1480545636.55059,
                "session_range_end" : 1480570804.26226,
                "tags" : [ 
                    "not_enough_sleep", 
                    "long_sleep_latency"
                ],
                "updated" : 1480570805.25201
            }
        ],
        "__v" : 0
    }

Several related questions like this and this do not seem to work for the data structure above. As recommended in other related questions I am trying to stay away from looping over each row for performance reasons (the full dataset is ~150MB). How would I flatten out the "data"-key with json_normalize so that each key is at the top-level? I would prefer one DataFrame where e.g. total_sleep_score is a column.

Any help is much appreciated! Even though I know how to 'prepare' the data using JavaScript, I would like to be able to understand and do it using Python.

edit (request from comment to show preferred structure):

{
    "user" : "xxx",
    "provider" : "beddit",
    "date" : ISODate("2016-11-30T23:00:00.000Z"),
    "end_timestamp" : 1480570804.26226,
    "properties.sleep_efficiency" : 0.8772404,
    "properties.resting_heart_rate" : 67.67578,
    "properties.short_term_resting_heart_rate" : 61.36963,
    "properties.activity_index" : 50.51958,
    "properties.average_respiration_rate" : 16.25667,
    "properties.total_sleep_score" : 64,
    "date" : "2016-12-01",
    "session_range_start" : 1480545636.55059,
    "start_timestamp" : 1480545636.55059,
    "session_range_end" : 1480570804.26226,
    "updated" : 1480570805.25201,
    "__v" : 0
}

The 'properties' append is not necessary but would be nice.

Can you provide an example of how the data should look like? — kiecodes
– kiecodes, Commented Feb 2, 2017 at 13:40
If this format for this JSON-Object is always the same, you could simply convert to the JSON to string and manipulate the string, if you don't wan't to loop through the object. But I am not sure if that is better performance wise. — kiecodes
– kiecodes, Commented Feb 2, 2017 at 13:51
The parameters within 'properties' are not always the same. For some nights certain parameters cannot be computed so those are omitted from the data received through their API. — martwetzels
– martwetzels, Commented Feb 2, 2017 at 13:53
Then I don't see a possibility to transform the json without iterating through it in any way. — kiecodes
– kiecodes, Commented Feb 2, 2017 at 13:55

Rakesh Kumar · Accepted Answer · 2017-02-02 15:13:42Z

0

Try This algo for flatten:-

def flattenPattern(pattern):
    newPattern = {}
    if type(pattern) is list:
        pattern = pattern[0]

    if type(pattern) is not str:
        for key, value in pattern.items():
            if type(value) in (list, dict):
                returnedData = flattenPattern(value)
                for i,j in returnedData.items():
                        if key == "data":
                            newPattern[i] = j
                        else:
                            newPattern[key + "." + i] = j
            else:
                newPattern[key] = value


    return newPattern 


 print(flattenPattern(dictFromJson))


OutPut:-
{  
  'session_range_start':1480545636.55059,
  'start_timestamp':1480545636.55059,
  'properties.average_respiration_rate':16.25667,
  'session_range_end':1480570804.26226,
  'properties.resting_heart_rate':67.67578,
  'properties.short_term_resting_heart_rate':61.36963,
  'updated':1480570805.25201,
  'properties.total_sleep_score':64,
  'properties.activity_index':50.51958,
  '__v':0,
  'user':'xxx',
  'provider':'beddit',
  'date':'2016-12-01',
  'properties.sleep_efficiency':0.8772404,
  'end_timestamp':1480570804.26226
}

answered Feb 2, 2017 at 15:13

Rakesh Kumar

4,4522 gold badges19 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

martwetzels Over a year ago

This will throw me a 'AttributeError: 'unicode' object has no attribute 'items'' ' error.

Rakesh Kumar Over a year ago

@martwetzels, i think you are passing your json directly to this function, first load json to variable like x = json.loads(your json string). Then this will work. and look at my variable name dictFromJson it indicts dict object from json.

martwetzels · Accepted Answer · 2017-02-04 16:39:22Z

Although not explicitly what I asked for, the following worked for me so far:

Step 1

Normalize the data record using json_normalize on the original dataset (not inside a Pandas DataFrame) and prefix the data.

beddit_data = pd.io.json.json_normalize(beddit, record_path='data', record_prefix='data.', meta='_id')

Step 2

The properties record was a Series with dicts so these can be 'formatted' with .apply(pd.Series)

beddit_data_properties = beddit_data['data.properties'].apply(pd.Series)

Step 3

Final step is to merge both DataFrames. In step 1, I kept the 'meta=_id' so that DataFrame can be merged with the original DataFrame from Bedit. I didn't include it in the final step yet because I can spend some time on the results from the results so far.

beddit_final = pd.concat([beddit_data_properties[:], beddit_data[:]], axis=1)

If anyone is interested, I can share the final Jupyter Notebook when it is ready :)

Collectives™ on Stack Overflow

Flattening nested JSON included embedded array in Python using Pandas

2 Answers 2

2 Comments

Step 1

Step 2

Step 3

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Step 1

Step 2

Step 3

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related