Pandas Dataframe to JSON Hierarchy

Question

I have exhaustively reviewed/attempted implementations all the other questions on SO corresponding to this challenge and have yet to reach a solution.

Question: how do I convert employee and supervisor pairs into a hierarchical JSON structure to be used for a D3 visualization? There are an unknown number of levels, so it has to be dynamic.

I have a dataframe with five columns (yes, I realize this isn't the actual hierarchy of The Office):

  Employee_FN Employee_LN Supervisor_FN Supervisor_LN  Level
0     Michael       Scott          None          None      0
1         Jim     Halpert       Michael         Scott      1
2      Dwight     Schrute       Michael         Scott      1
3     Stanley      Hudson           Jim       Halpert      2
4         Pam     Beasley           Jim       Halpert      2
5        Ryan      Howard           Pam       Beasley      3
6       Kelly      Kapoor          Ryan        Howard      4
7    Meredith      Palmer          Ryan        Howard      4

Desired Output Snapshot:

{
  "Employee_FN": "Michael",
  "Employee_LN": "Scott",
  "Level": "0",
  "Reports": [{
        "Employee_FN": "Jim",
        "Employee_LN": "Halpert",
        "Level": "1",
        "Reports": [{
              "Employee_FN": "Stanley",
              "Employee_LN": "Hudson",
              "Level": "2",
            }, {
              "Employee_FN": "Pam",
              "Employee_LN": "Beasley",
              "Level": "2",
            }]
        }]
}

Current State:

j = (df.groupby(['Level','Employee_FN','Employee_LN'], as_index=False)
             .apply(lambda x: x[['Level','Employee_FN','Employee_LN']].to_dict('r'))
             .reset_index()
             .rename(columns={0:'Reports'})
             .to_json(orient='records'))

print(json.dumps(json.loads(j), indent=2, sort_keys=True))

Current Output:

[
  {
    "Employee_FN": "Michael",
    "Employee_LN": "Scott",
    "Level": 0,
    "Reports": [
      {
        "Employee_FN": "Michael",
        "Employee_LN": "Scott",
        "Level": 0
      }
    ]
  },
  {
    "Employee_FN": "Dwight",
    "Employee_LN": "Schrute",
    "Level": 1,
    "Reports": [
      {
        "Employee_FN": "Dwight",
        "Employee_LN": "Schrute",
        "Level": 1
      }
    ]
  },
  {
    "Employee_FN": "Jim",
    "Employee_LN": "Halpert",
    "Level": 1,
    "Reports": [
      {
        "Employee_FN": "Jim",
        "Employee_LN": "Halpert",
        "Level": 1
      }
    ]
  },
  {
    "Employee_FN": "Pam",
    "Employee_LN": "Beasley",
    "Level": 2,
    "Reports": [
      {
        "Employee_FN": "Pam",
        "Employee_LN": "Beasley",
        "Level": 2
      }
    ]
  },
  {
    "Employee_FN": "Stanley",
    "Employee_LN": "Hudson",
    "Level": 2,
    "Reports": [
      {
        "Employee_FN": "Stanley",
        "Employee_LN": "Hudson",
        "Level": 2
      }
    ]
  },
  {
    "Employee_FN": "Ryan",
    "Employee_LN": "Howard",
    "Level": 3,
    "Reports": [
      {
        "Employee_FN": "Ryan",
        "Employee_LN": "Howard",
        "Level": 3
      }
    ]
  },
  {
    "Employee_FN": "Kelly",
    "Employee_LN": "Kapoor",
    "Level": 4,
    "Reports": [
      {
        "Employee_FN": "Kelly",
        "Employee_LN": "Kapoor",
        "Level": 4
      }
    ]
  },
  {
    "Employee_FN": "Meredith",
    "Employee_LN": "Palmer",
    "Level": 4,
    "Reports": [
      {
        "Employee_FN": "Meredith",
        "Employee_LN": "Palmer",
        "Level": 4
      }
    ]
  }
]

Problems:

Each person only has themselves as children
The whole JSON structure appears to be in a dict - I believe it has to be enclosed by {} to be readable

I have tried switched around the groupby and lambda elements in various configurations to reach the desired output as well. Any and all insight would be greatly appreciated! Thank you!

Update:

I changed my code block to this:

j = (df.groupby(['Level','Supervisor_FN','Supervisor_LN'], as_index=False)
             .apply(lambda x: x[['Level','Employee_FN','Employee_LN']].to_dict('r'))
             .reset_index()
             .rename(columns={0:'Reports'})
             .rename(columns={'Supervisor_FN':'Employee_FN'})
             .rename(columns={'Supervisor_LN':'Employee_LN'})
             .to_json(orient='records'))

print(json.dumps(json.loads(j), indent=2, sort_keys=True))

The new output is this:

[
  {
    "Employee_FN": "Michael",
    "Employee_LN": "Scott",
    "Level": 1,
    "Reports": [
      {
        "Employee_FN": "Jim",
        "Employee_LN": "Halpert",
        "Level": 1
      },
      {
        "Employee_FN": "Dwight",
        "Employee_LN": "Schrute",
        "Level": 1
      }
    ]
  },
  {
    "Employee_FN": "Jim",
    "Employee_LN": "Halpert",
    "Level": 2,
    "Reports": [
      {
        "Employee_FN": "Stanley",
        "Employee_LN": "Hudson",
        "Level": 2
      },
      {
        "Employee_FN": "Pam",
        "Employee_LN": "Beasley",
        "Level": 2
      }
    ]
  },
  {
    "Employee_FN": "Pam",
    "Employee_LN": "Beasley",
    "Level": 3,
    "Reports": [
      {
        "Employee_FN": "Ryan",
        "Employee_LN": "Howard",
        "Level": 3
      }
    ]
  },
  {
    "Employee_FN": "Ryan",
    "Employee_LN": "Howard",
    "Level": 4,
    "Reports": [
      {
        "Employee_FN": "Kelly",
        "Employee_LN": "Kapoor",
        "Level": 4
      },
      {
        "Employee_FN": "Meredith",
        "Employee_LN": "Palmer",
        "Level": 4
      }
    ]
  }
]

Problems:

The Level matches the underlying employee for both the underlying employee and the supervisor
The nesting only goes one level deep

For 1, just adding a 'Sup_level' column, with df['Sup_level'] = df['Level']-1, and adding appropriately to the 'rename' bit with .rename(columns={0:'Reports', 'Sup_level':'Level', 'Supervisor_FN':'Employee_FN','Supervisor_LN':'Employee_LN'}) should work. — EFT
– EFT, Commented May 18, 2017 at 17:20
Thank you very much - that allows the levels to match appropriately. The issue of the resulting JSON only going one level deep still remains. — OverflowingTheGlass
– OverflowingTheGlass, Commented May 18, 2017 at 17:24

Igor Raush · Accepted Answer · 2017-05-18 19:23:27Z

4

This type of problem isn't particularly well-suited for Pandas; the data structure you're going after is recursive, not tabular.

Here is one possible solution.

from operator import itemgetter

employee_key = itemgetter('Employee_FN', 'Employee_LN')
supervisor_key = itemgetter('Supervisor_FN', 'Supervisor_LN')

def subset(dict_, keys):
    return {k: dict_[k] for k in keys}

# store employee references
cache = {}

# iterate over employees sorted by level, so supervisors are cached before reports
for row in df.sort_values('Level').to_dict('records'):

    # look up employee/supervisor references
    employee = cache.setdefault(employee_key(row), subset(row, keys=('Employee_FN', 'Employee_LN', 'Level')))
    supervisor = cache.get(supervisor_key(row), {})

    # link reports to employee
    supervisor.setdefault('Reports', []).append(employee)

# grab only top-level employees
[rec for key, rec in cache.iteritems() if rec['Level'] == 0]

[{'Employee_FN': 'Michael',
  'Employee_LN': 'Scott',
  'Level': 0,
  'Reports': [{'Employee_FN': 'Jim',
    'Employee_LN': 'Halpert',
    'Level': 1,
    'Reports': [{'Employee_FN': 'Stanley',
      'Employee_LN': 'Hudson',
      'Level': 2},
     {'Employee_FN': 'Pam',
      'Employee_LN': 'Beasley',
      'Level': 2,
      'Reports': [{'Employee_FN': 'Ryan',
        'Employee_LN': 'Howard',
        'Level': 3,
        'Reports': [{'Employee_FN': 'Kelly',
          'Employee_LN': 'Kapoor',
          'Level': 4},
         {'Employee_FN': 'Meredith',
          'Employee_LN': 'Palmer',
          'Level': 4}]}]}]},
   {'Employee_FN': 'Dwight', 'Employee_LN': 'Schrute', 'Level': 1}]}]

edited May 18, 2017 at 19:23

answered May 18, 2017 at 19:17

Igor Raush

15.3k1 gold badge38 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

OverflowingTheGlass Over a year ago

Thank you. When I run your edited code, it returns AttributeError: 'dict' object has no attribute 'iteritems' for the last line. Also, how do I incorporate that into actually writing to the JSON format? Forgive me - I am quite new to all of this.

Igor Raush Over a year ago

@Cameron you must be using Python 3.x? Change iteritems() to items(), and it should work for you. The last line produces a serializable list. You can pass it directly to json.dumps(...) to produce a JSON string.

OverflowingTheGlass Over a year ago

It works perfectly, of course - thank you very much! If you happen to know of any good Python + D3 resources, I would be all ears! Thanks, again.

Collectives™ on Stack Overflow

Pandas Dataframe to JSON Hierarchy

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related