2

Assume the following snip of a JSON file to be flattened on Python.

{
  "locations" : [ {
    "timestampMs" : "1549913792265",
    "latitudeE7" : 323518421,
    "longitudeE7" : -546166813,
    "accuracy" : 13,
    "altitude" : 1,
    "verticalAccuracy" : 2,
    "activity" : [ {
      "timestampMs" : "1549913286057",
      "activity" : [ {
        "type" : "STILL",
        "confidence" : 100
      } ]
    }, {
      "timestampMs" : "1549913730454",
      "activity" : [ {
        "type" : "DRIVING",
        "confidence" : 100
      } ]
    } ]
  }, {
    "timestampMs" : "1549912693813",
    "latitudeE7" : 323518421,
    "longitudeE7" : -546166813,
    "accuracy" : 13,
    "altitude" : 1,
    "verticalAccuracy" : 2,
    "activity" : [ {
      "timestampMs" : "1549911547308",
      "activity" : [ {
        "type" : "ACTIVE",
        "confidence" : 100
      } ]
    }, {
      "timestampMs" : "1549912330473",
      "activity" : [ {
        "type" : "BIKING",
        "confidence" : 100
      } ]
    } ]
  } ]
}

The goal is to turn it into a flattened dataframe like this:

location_id timestampMs ... verticalAccuracy activity_timestampMs activity_activity_type ...
1           1549913792265   13               1549913286057        "STILL"
1           1549913792265   13               1549913730454        "DRIVING"
etc.

How would one do so given that the key 'activity' is repeated at different nest levels?

1 Answer 1

1

Here is a solution using json_normalize (documentation), assuming the JSON snippet you posted is in a python dictionary named d.

from pandas.io.json import json_normalize

# Build a list of paths to JSON fields that will end up as metadata
# in the final DataFrame
meta = list(js['locations'][0].keys())

# meta is now this:
# ['timestampMs',
# 'latitudeE7',
# 'longitudeE7',
# 'accuracy',
# 'altitude',
# 'verticalAccuracy',
# 'activity']

# Almost correct. We need to remove 'activity' and append
# the list ['activity', 'timestampMs'] to meta.
meta.remove('activity')
meta.append(['activity', 'timestampMs'])

# meta is now this:
# ['timestampMs',
# 'latitudeE7',
# 'longitudeE7',
# 'accuracy',
# 'altitude',
# 'verticalAccuracy',
# ['activity', 'timestampMs']]

# Use json_normalize on the list of dicts
# that lives at d['locations'], passing in
# the appropriate record path and metadata
# paths, and specifying the double 'activity_'
# record prefix.
json_normalize(d['locations'], 
               record_path=['activity', 'activity'], 
               meta=meta,
               record_prefix='activity_activity_')

   activity_activity_confidence activity_activity_type    timestampMs  latitudeE7  longitudeE7  accuracy  altitude  verticalAccuracy activity.timestampMs
0                           100                  STILL  1549913792265   323518421   -546166813        13         1                 2        1549913286057
1                           100                DRIVING  1549913792265   323518421   -546166813        13         1                 2        1549913730454
2                           100                 ACTIVE  1549912693813   323518421   -546166813        13         1                 2        1549911547308
3                           100                 BIKING  1549912693813   323518421   -546166813        13         1                 2        1549912330473

Edit

If the ['activity', 'activity'] record path is sometimes missing, the above code will throw an error. The following workaround should work for this specific case, but is brittle and might be unacceptably slow depending on the size of your input data:

# Create an example by deleting one of the 'activity' paths 
# from the original dict
del d['locations'][0]['activity']

pd.concat([json_normalize(x, 
                          record_path=['activity', 'activity'] 
                                      if 'activity' in x.keys() else None, 
                          meta=meta, 
                          record_prefix='activity_activity_') 
           for x in d['locations']], 
          axis=0, 
          ignore_index=True,
          sort=False)

   accuracy  altitude  latitudeE7  longitudeE7    timestampMs  verticalAccuracy  activity_activity_confidence activity_activity_type activity.timestampMs
0        13         1   323518421   -546166813  1549913792265                 2                           NaN                    NaN                  NaN
1        13         1   323518421   -546166813  1549912693813                 2                         100.0                 ACTIVE        1549911547308
2        13         1   323518421   -546166813  1549912693813                 2                         100.0                 BIKING        1549912330473
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you @Peter! Marked this as right as it answers my question and works! However, when working with the full dataset I get KeyError: 'activity'. Currently investigating what the problem could be.
The problem was easy to spot when I imported the JSON file on R. Apparently, not all 'locations' have 'activity' within them (that's the higher level 'activity' - which automatically includes the lower level 'activity'). Any ideas?
@Deuterium, thanks for accepting - interesting follow-up problem. I have edited a possible solution into the bottom of my answer, but am hitting on the limits of my experience with JSON.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.