Inserting "missing" multiindex rows into a Pandas Dataframe

Question

I have a pandas DataFrame with a two-level multiindex. The second level is numeric and supposed to be sorted and sequential for each unique value of the first-level index, but has gaps. How do I insert the "missing" rows? Sample input:

import pandas as pd
df = pd.DataFrame(list(range(5)),
                  index=pd.MultiIndex.from_tuples([('A',1), ('A',3),
                                                   ('B',2), ('B',3), ('B',6)]),
                  columns='value')
#     value
#A 1      0
#  3      1
#B 2      2
#  3      3
#  6      4

Expected output:

#     value
#A 1      0
#  2    NaN
#  3      1
#B 2      2
#  3      3
#  4    NaN
#  5    NaN
#  6      4

I suspect I could have used resample, but I am having trouble converting the numbers to anything date-like.

@Wen-Ben Thanks! That would be my last-resort option. I hate using loops in pandas. — DYZ
– DYZ, Commented Jan 30, 2019 at 22:00

Scott Boston · Accepted Answer · 2019-01-30 22:10:17Z

2

If there is a will, there is a way. I am not proud of this but, I think it works.

Try:

def f(x):
    levels = x.index.remove_unused_levels().levels
    x = x.reindex(pd.MultiIndex.from_product([levels[0], np.arange(levels[1][0], levels[1][-1]+1)]))
    return x

df.groupby(level=0, as_index=False, group_keys=False).apply(f)

Output:

     value
A 1    0.0
  2    NaN
  3    1.0
B 2    2.0
  3    3.0
  4    NaN
  5    NaN
  6    4.0

answered Jan 30, 2019 at 22:10

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

DYZ · Accepted Answer · 2019-01-30 22:55:21Z

2

After much deliberations, I was able to come up with a solution myself. Judging by the fact of how lousy it is, the problem I am facing is not a very typical one.

new_index = d.index.to_frame()\
                .groupby(0)[1]\
                .apply(lambda x:
                         pd.Series(1, index=range(x.min(), x.max() + 1))).index
d.reindex(new_index)

answered Jan 30, 2019 at 22:55

DYZ

57.3k10 gold badges73 silver badges101 bronze badges

Comments

amirsina torfi · Accepted Answer · 2021-05-07 00:11:56Z

2

You can simply use the following depends on the missing index:

result.unstack(1).stack(0, dropna=False).fillna(0)

When you unstack, the pandas expand the df to have rows and columns and in the above example, level 1 index is going to be the column names. Then, again by stacking, you return the df to its original form, BUT, this time you need to make sure you use dropna=False so the NaN values are going to be there for missing indexes. In the end, using .fillna(0) is optional depends on what you want to do with the NaN values.

answered May 7, 2021 at 0:11

amirsina torfi

1812 silver badges5 bronze badges

1 Comment

FirefoxMetzger Over a year ago

This will also create (B, 1) which is not in the desired dataframe. It also doesn't insert the new values (B, 4) and (B, 5).

Jesper Mølgaard · Accepted Answer · 2024-01-31 13:59:07Z

1

Little late for the party i see, but for future travelers, I think I found a solution:

Use pandas multiindex from product function to generate all combinations of levels of indices:

df_new_index = pd.MultiIndex.from_product([
  df.index.get_level_values(0).unique(),
  df.index.get_level_values(1).unique()])

df_reindexed = df.reindex(df_new_index)

answered Jan 31, 2024 at 13:59

Jesper Mølgaard

16711 bronze badges

Comments

FirefoxMetzger · Accepted Answer · 2021-12-06 09:55:52Z

There's no accounting for taste, but I think falling back to list comprehension leads to slightly more readable code:

df.reindex(
    pd.MultiIndex.from_tuples([
        (level_0, level_1)
        for level_0 in df.reset_index(0).level_0.unique()
        for level_1 in range(
            df.reset_index(1).loc[level_0, "level_1"].min(),
            df.reset_index(1).loc[level_0, "level_1"].max()+1
        )
]))

# Output:
#value
#A  1   0.0
#   2   NaN
#   3   1.0
#B  2   2.0
#   3   3.0
#   4   NaN
#   5   NaN
#   6   4.0

Although this is of course slower than going down the apply route:

list-comprehension: 2.57 ms ± 19 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
DYZ apply: 1.25 ms ± 8.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Scott's apply: 2.19 ms ± 9.84 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Collectives™ on Stack Overflow

Inserting "missing" multiindex rows into a Pandas Dataframe

5 Answers 5

Comments

Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related