Converting nested JSON file with duplicate keys to dataframe in Python

Question

Assume the following snip of a JSON file to be flattened on Python.

{
  "locations" : [ {
    "timestampMs" : "1549913792265",
    "latitudeE7" : 323518421,
    "longitudeE7" : -546166813,
    "accuracy" : 13,
    "altitude" : 1,
    "verticalAccuracy" : 2,
    "activity" : [ {
      "timestampMs" : "1549913286057",
      "activity" : [ {
        "type" : "STILL",
        "confidence" : 100
      } ]
    }, {
      "timestampMs" : "1549913730454",
      "activity" : [ {
        "type" : "DRIVING",
        "confidence" : 100
      } ]
    } ]
  }, {
    "timestampMs" : "1549912693813",
    "latitudeE7" : 323518421,
    "longitudeE7" : -546166813,
    "accuracy" : 13,
    "altitude" : 1,
    "verticalAccuracy" : 2,
    "activity" : [ {
      "timestampMs" : "1549911547308",
      "activity" : [ {
        "type" : "ACTIVE",
        "confidence" : 100
      } ]
    }, {
      "timestampMs" : "1549912330473",
      "activity" : [ {
        "type" : "BIKING",
        "confidence" : 100
      } ]
    } ]
  } ]
}

The goal is to turn it into a flattened dataframe like this:

location_id timestampMs ... verticalAccuracy activity_timestampMs activity_activity_type ...
1           1549913792265   13               1549913286057        "STILL"
1           1549913792265   13               1549913730454        "DRIVING"
etc.

How would one do so given that the key 'activity' is repeated at different nest levels?

Peter Leimbigler · Accepted Answer · 2019-02-13 17:22:27Z

1

Here is a solution using json_normalize (documentation), assuming the JSON snippet you posted is in a python dictionary named d.

from pandas.io.json import json_normalize

# Build a list of paths to JSON fields that will end up as metadata
# in the final DataFrame
meta = list(js['locations'][0].keys())

# meta is now this:
# ['timestampMs',
# 'latitudeE7',
# 'longitudeE7',
# 'accuracy',
# 'altitude',
# 'verticalAccuracy',
# 'activity']

# Almost correct. We need to remove 'activity' and append
# the list ['activity', 'timestampMs'] to meta.
meta.remove('activity')
meta.append(['activity', 'timestampMs'])

# meta is now this:
# ['timestampMs',
# 'latitudeE7',
# 'longitudeE7',
# 'accuracy',
# 'altitude',
# 'verticalAccuracy',
# ['activity', 'timestampMs']]

# Use json_normalize on the list of dicts
# that lives at d['locations'], passing in
# the appropriate record path and metadata
# paths, and specifying the double 'activity_'
# record prefix.
json_normalize(d['locations'], 
               record_path=['activity', 'activity'], 
               meta=meta,
               record_prefix='activity_activity_')

   activity_activity_confidence activity_activity_type    timestampMs  latitudeE7  longitudeE7  accuracy  altitude  verticalAccuracy activity.timestampMs
0                           100                  STILL  1549913792265   323518421   -546166813        13         1                 2        1549913286057
1                           100                DRIVING  1549913792265   323518421   -546166813        13         1                 2        1549913730454
2                           100                 ACTIVE  1549912693813   323518421   -546166813        13         1                 2        1549911547308
3                           100                 BIKING  1549912693813   323518421   -546166813        13         1                 2        1549912330473

Edit

If the ['activity', 'activity'] record path is sometimes missing, the above code will throw an error. The following workaround should work for this specific case, but is brittle and might be unacceptably slow depending on the size of your input data:

# Create an example by deleting one of the 'activity' paths 
# from the original dict
del d['locations'][0]['activity']

pd.concat([json_normalize(x, 
                          record_path=['activity', 'activity'] 
                                      if 'activity' in x.keys() else None, 
                          meta=meta, 
                          record_prefix='activity_activity_') 
           for x in d['locations']], 
          axis=0, 
          ignore_index=True,
          sort=False)

   accuracy  altitude  latitudeE7  longitudeE7    timestampMs  verticalAccuracy  activity_activity_confidence activity_activity_type activity.timestampMs
0        13         1   323518421   -546166813  1549913792265                 2                           NaN                    NaN                  NaN
1        13         1   323518421   -546166813  1549912693813                 2                         100.0                 ACTIVE        1549911547308
2        13         1   323518421   -546166813  1549912693813                 2                         100.0                 BIKING        1549912330473

edited Feb 13, 2019 at 17:22

answered Feb 13, 2019 at 5:00

Peter Leimbigler

11.1k1 gold badge27 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Maher Said Over a year ago

Thank you @Peter! Marked this as right as it answers my question and works! However, when working with the full dataset I get KeyError: 'activity'. Currently investigating what the problem could be.

Maher Said Over a year ago

The problem was easy to spot when I imported the JSON file on R. Apparently, not all 'locations' have 'activity' within them (that's the higher level 'activity' - which automatically includes the lower level 'activity'). Any ideas?

Peter Leimbigler Over a year ago

@Deuterium, thanks for accepting - interesting follow-up problem. I have edited a possible solution into the bottom of my answer, but am hitting on the limits of my experience with JSON.

Collectives™ on Stack Overflow

Converting nested JSON file with duplicate keys to dataframe in Python

1 Answer 1

Edit

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Edit

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related