Construct pandas DataFrame from items in nested dictionary

Question

Suppose I have a nested dictionary 'user_dict' with structure:

Level 1: UserId (Long Integer)
Level 2: Category (String)
Level 3: Assorted Attributes (floats, ints, etc..)

For example, an entry of this dictionary would be:

user_dict[12] = {
    "Category 1": {"att_1": 1, 
                   "att_2": "whatever"},
    "Category 2": {"att_1": 23, 
                   "att_2": "another"}}

each item in user_dict has the same structure and user_dict contains a large number of items which I want to feed to a pandas DataFrame, constructing the series from the attributes. In this case a hierarchical index would be useful for the purpose.

Specifically, my question is whether there exists a way to to help the DataFrame constructor understand that the series should be built from the values of the "level 3" in the dictionary?

If I try something like:

df = pandas.DataFrame(users_summary)

The items in "level 1" (the UserId's) are taken as columns, which is the opposite of what I want to achieve (have UserId's as index).

I know I could construct the series after iterating over the dictionary entries, but if there is a more direct way this would be very useful. A similar question would be asking whether it is possible to construct a pandas DataFrame from json objects listed in a file.

See this answer for simpler alternatives.

cs95
– cs95

2019-01-22 04:34:10 +00:00
Commented Jan 22, 2019 at 4:34 — cs95
– cs95, Commented Jan 22, 2019 at 4:34

maxymoo · Accepted Answer · 2017-01-09 23:36:38Z

219

A pandas MultiIndex consists of a list of tuples. So the most natural approach would be to reshape your input dict so that its keys are tuples corresponding to the multi-index values you require. Then you can just construct your dataframe using pd.DataFrame.from_dict, using the option orient='index':

user_dict = {12: {'Category 1': {'att_1': 1, 'att_2': 'whatever'},
                  'Category 2': {'att_1': 23, 'att_2': 'another'}},
             15: {'Category 1': {'att_1': 10, 'att_2': 'foo'},
                  'Category 2': {'att_1': 30, 'att_2': 'bar'}}}

pd.DataFrame.from_dict({(i,j): user_dict[i][j] 
                           for i in user_dict.keys() 
                           for j in user_dict[i].keys()},
                       orient='index')


               att_1     att_2
12 Category 1      1  whatever
   Category 2     23   another
15 Category 1     10       foo
   Category 2     30       bar

An alternative approach would be to build your dataframe up by concatenating the component dataframes:

user_ids = []
frames = []

for user_id, d in user_dict.iteritems():
    user_ids.append(user_id)
    frames.append(pd.DataFrame.from_dict(d, orient='index'))

pd.concat(frames, keys=user_ids)

               att_1     att_2
12 Category 1      1  whatever
   Category 2     23   another
15 Category 1     10       foo
   Category 2     30       bar

edited Jan 9, 2017 at 23:36

maxymoo

36.7k12 gold badges97 silver badges121 bronze badges

answered Nov 27, 2012 at 10:05

Wouter Overmeire

69.7k10 gold badges67 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

naught101 Over a year ago

Is there a reasonable way to generalise this to work with arbitrary depth ragged lists? e.g. lists to an arbitrary depth, where some branches may be shorter than others, and a None or nan is used when shorter branches don't reach the end?

Wouter Overmeire Over a year ago

Have you looked at pandas json support (io tools) and normalization? pandas.pydata.org/pandas-docs/dev/io.html#normalization

arturomp Over a year ago

for me, the first method created a dataframe with a single index with tuples. the second method worked as desired/expected!

Wouter Overmeire Over a year ago

@cheremushkin 12 and 15 are now in the row 'id', if you tranpose (pandas.pydata.org/pandas-docs/stable/reference/api/…) they are in the column 'id'. You can also unstack (pandas.pydata.org/pandas-docs/stable/reference/api/…) It all depends on what you really need.

Madcat Over a year ago

in python 3 a dict no longer has iteritems method, in the 2nd approach this line for user_id, d in user_dict.iteritems(): should be changed for user_id, d in user_dict.items():

|

cs95 · Accepted Answer · 2019-01-22 03:41:56Z

74

pd.concat accepts a dictionary. With this in mind, it is possible to improve upon the currently accepted answer in terms of simplicity and performance by use a dictionary comprehension to build a dictionary mapping keys to sub-frames.

pd.concat({k: pd.DataFrame(v).T for k, v in user_dict.items()}, axis=0)

Or,

pd.concat({
        k: pd.DataFrame.from_dict(v, 'index') for k, v in user_dict.items()
    }, 
    axis=0)

              att_1     att_2
12 Category 1     1  whatever
   Category 2    23   another
15 Category 1    10       foo
   Category 2    30       bar

answered Jan 22, 2019 at 3:41

cs95

406k106 gold badges744 silver badges797 bronze badges

8 Comments

pg2455 Over a year ago

Brilliant! Much better :)

Lucas Aimaretto Over a year ago

How would you do it if you still had an further inner category? Such as 12:{cat1:{cat11:{att1:val1,att2:val2}}}. In other words: how would some one generalize the solution to an irrelevant number of categories ?

cs95 Over a year ago

@LucasAimaretto Usually arbitrarily nested structures can be flattened with json_normalize. I have another answer which shows how it works.

swiss_knight Over a year ago

Doesn't work if v is an single integer for example. Do you know an alternative in such case?

Tito Sanz Over a year ago

Why not to treat as a nested json? pd.json_normalize should do the work for you

|

tRosenflanz · Accepted Answer · 2021-03-24 21:24:20Z

37

This solution should work for arbitrary depth by flattening dictionary keys to a tuple chain

def flatten_dict(nested_dict):
    res = {}
    if isinstance(nested_dict, dict):
        for k in nested_dict:
            flattened_dict = flatten_dict(nested_dict[k])
            for key, val in flattened_dict.items():
                key = list(key)
                key.insert(0, k)
                res[tuple(key)] = val
    else:
        res[()] = nested_dict
    return res


def nested_dict_to_df(values_dict):
    flat_dict = flatten_dict(values_dict)
    df = pd.DataFrame.from_dict(flat_dict, orient="index")
    df.index = pd.MultiIndex.from_tuples(df.index)
    df = df.unstack(level=-1)
    df.columns = df.columns.map("{0[1]}".format)
    return df

answered Mar 24, 2021 at 21:24

tRosenflanz

8246 silver badges8 bronze badges

3 Comments

Wacao Over a year ago

This one is most generic solution, for n level depth and also handles short branches. +1

Ernesto Over a year ago

This solution does wonders! Thanks for sharing @Wacao

kpierce8 Over a year ago

flatten_dict worked great for my weird use case. The second part didn't work but I have a weirdly formatted data structure. Helpful though.

Melkor.cz · Accepted Answer · 2020-03-30 14:19:04Z

23

In case someone wants to get the data frame in a "long format" (leaf values have the same type) without multiindex, you can do this:

pd.DataFrame.from_records(
    [
        (level1, level2, level3, leaf)
        for level1, level2_dict in user_dict.items()
        for level2, level3_dict in level2_dict.items()
        for level3, leaf in level3_dict.items()
    ],
    columns=['UserId', 'Category', 'Attribute', 'value']
)

    UserId    Category Attribute     value
0       12  Category 1     att_1         1
1       12  Category 1     att_2  whatever
2       12  Category 2     att_1        23
3       12  Category 2     att_2   another
4       15  Category 1     att_1        10
5       15  Category 1     att_2       foo
6       15  Category 2     att_1        30
7       15  Category 2     att_2       bar

(I know the original question probably wants (I.) to have Levels 1 and 2 as multiindex and Level 3 as columns and (II.) asks about other ways than iteration over values in the dict. But I hope this answer is still relevant and useful (I.): to people like me who have tried to find a way to get the nested dict into this shape and google only returns this question and (II.): because other answers involve some iteration as well and I find this approach flexible and easy to read; not sure about performance, though.)

answered Mar 30, 2020 at 14:19

Melkor.cz

2,3171 gold badge19 silver badges15 bronze badges

1 Comment

grofte Over a year ago

This is great for visualizations and I'm sure if someone had the time they could make it into a function that just takes the dictionary and column list as arguments and infers the nesting depth from the columns.

Mishiko · Accepted Answer · 2014-11-07 14:44:33Z

So I used to use a for loop for iterating through the dictionary as well, but one thing I've found that works much faster is to convert to a panel and then to a dataframe. Say you have a dictionary d

import pandas as pd
d
{'RAY Index': {datetime.date(2014, 11, 3): {'PX_LAST': 1199.46,
'PX_OPEN': 1200.14},
datetime.date(2014, 11, 4): {'PX_LAST': 1195.323, 'PX_OPEN': 1197.69},
datetime.date(2014, 11, 5): {'PX_LAST': 1200.936, 'PX_OPEN': 1195.32},
datetime.date(2014, 11, 6): {'PX_LAST': 1206.061, 'PX_OPEN': 1200.62}},
'SPX Index': {datetime.date(2014, 11, 3): {'PX_LAST': 2017.81,
'PX_OPEN': 2018.21},
datetime.date(2014, 11, 4): {'PX_LAST': 2012.1, 'PX_OPEN': 2015.81},
datetime.date(2014, 11, 5): {'PX_LAST': 2023.57, 'PX_OPEN': 2015.29},
datetime.date(2014, 11, 6): {'PX_LAST': 2031.21, 'PX_OPEN': 2023.33}}}

The command

pd.Panel(d)
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 4 (minor_axis)
Items axis: RAY Index to SPX Index
Major_axis axis: PX_LAST to PX_OPEN
Minor_axis axis: 2014-11-03 to 2014-11-06

where pd.Panel(d)[item] yields a dataframe

pd.Panel(d)['SPX Index']
2014-11-03  2014-11-04  2014-11-05 2014-11-06
PX_LAST 2017.81 2012.10 2023.57 2031.21
PX_OPEN 2018.21 2015.81 2015.29 2023.33

You can then hit the command to_frame() to turn it into a dataframe. I use reset_index as well to turn the major and minor axis into columns rather than have them as indices.

pd.Panel(d).to_frame().reset_index()
major   minor      RAY Index    SPX Index
PX_LAST 2014-11-03  1199.460    2017.81
PX_LAST 2014-11-04  1195.323    2012.10
PX_LAST 2014-11-05  1200.936    2023.57
PX_LAST 2014-11-06  1206.061    2031.21
PX_OPEN 2014-11-03  1200.140    2018.21
PX_OPEN 2014-11-04  1197.690    2015.81
PX_OPEN 2014-11-05  1195.320    2015.29
PX_OPEN 2014-11-06  1200.620    2023.33

Finally, if you don't like the way the frame looks you can use the transpose function of panel to change the appearance before calling to_frame() see documentation here http://pandas.pydata.org/pandas-docs/dev/generated/pandas.Panel.transpose.html

Just as an example

pd.Panel(d).transpose(2,0,1).to_frame().reset_index()
major        minor  2014-11-03  2014-11-04  2014-11-05  2014-11-06
RAY Index   PX_LAST 1199.46    1195.323     1200.936    1206.061
RAY Index   PX_OPEN 1200.14    1197.690     1195.320    1200.620
SPX Index   PX_LAST 2017.81    2012.100     2023.570    2031.210
SPX Index   PX_OPEN 2018.21    2015.810     2015.290    2023.330

Hope this helps.

Panel is deprecated in more recent versions of pandas (v0.23 at the time of writing).

Pss · Accepted Answer · 2021-03-09 01:06:48Z

6

For other ways to represent the data, you don't need to do much. For example, if you just want the "outer" key to be an index, the "inner" key to be columns and the values to be cell values, this would do the trick:

df = pd.DataFrame.from_dict(user_dict, orient='index')

answered Mar 9, 2021 at 1:06

Pss

6631 gold badge6 silver badges16 bronze badges

3 Comments

PM0087 Over a year ago

I get error:

Anaconda3\lib\site-packages\pandas\core\internals\construction.py:309: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray   values = np.array([convert(v) for v in values])

questionto42 Over a year ago

Does not work for nested dict where the values stay in dictionary form. Still good if you want the right output shape (transposed) to start with.

odunayo12 Over a year ago

It works for dict() nested up to two levels. I have not tested any more levels.

El_1988 · Accepted Answer · 2020-10-05 04:41:09Z

5

Building on verified answer, for me this worked best:

ab = pd.concat({k: pd.DataFrame(v).T for k, v in data.items()}, axis=0)
ab.T

answered Oct 5, 2020 at 4:41

El_1988

3393 silver badges13 bronze badges

2 Comments

Saurabh Bade Over a year ago

It would be better if you could explain in details.

Robbie Capps Over a year ago

^^This is the correct answer... would edit but apparently "the suggested edit queue is full"

Collectives™ on Stack Overflow

Construct pandas DataFrame from items in nested dictionary

7 Answers 7

9 Comments

8 Comments

3 Comments

1 Comment

1 Comment

3 Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

9 Comments

8 Comments

3 Comments

1 Comment

1 Comment

3 Comments

2 Comments

Linked

Related