Efficient way for additional indexing of pandas dataframe

Question

I am working on a rather large, binary (n-hot encoded) dataframe, which structure is similar to this toy example:

import pandas as pd

data = {
  'A' : [0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
  'B' : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  'C' : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
  'D' : [1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
  'E' : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}

df = pd.DataFrame.from_dict(data)
df
    A  B  C  D  E
0   0  0  0  1  0
1   0  0  0  1  0
2   0  0  0  1  0
3   1  0  0  0  0
4   1  0  0  0  0
5   1  0  0  0  0
6   1  0  0  0  0
7   1  0  0  0  0
8   0  0  0  1  0
9   0  0  0  1  0
10  0  0  1  0  0
11  0  0  1  0  0
12  0  0  1  0  0
13  0  0  1  0  0

To make some extractions and searches more efficient, I would like to generate some additional indexes, apart from the generic one, like this:

          A  B  C  D  E
0   1  D  0  0  0  1  0
1   1  D  0  0  0  1  0
2   1  D  0  0  0  1  0
3   1  A  1  0  0  0  0
4   1  A  1  0  0  0  0
5   1  A  1  0  0  0  0
6   1  A  1  0  0  0  0
7   1  A  1  0  0  0  0
8   2  D  0  0  0  1  0
9   2  D  0  0  0  1  0
10  1  C  0  0  1  0  0
11  1  C  0  0  1  0  0
12  1  C  0  0  1  0  0
13  1  C  0  0  1  0  0

Two new columns show which column contains 1 and what appearance that is. Even better would be to have yet another one:

             A  B  C  D  E
0   1  1  D  0  0  0  1  0
1   1  2  D  0  0  0  1  0
2   1  3  D  0  0  0  1  0
3   1  1  A  1  0  0  0  0
4   1  2  A  1  0  0  0  0
5   1  3  A  1  0  0  0  0
6   1  4  A  1  0  0  0  0
7   1  5  A  1  0  0  0  0
8   2  1  D  0  0  0  1  0
9   2  2  D  0  0  0  1  0
10  1  1  C  0  0  1  0  0
11  1  2  C  0  0  1  0  0
12  1  3  C  0  0  1  0  0
13  1  4  C  0  0  1  0  0

What would be the most efficient way to generate these indexes?

Chrysophylaxs · Accepted Answer · 2022-12-10 21:20:49Z

Pandas 1.5 introduced pd.from_dummies to decode the one hot variables. I've done some playing around with groupbys and cumsums to come up with the following, but there could definitely be better ways to compute some of them (the "group_num" column in particular :) ):

import pandas as pd

data = {
  'A' : [0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
  'B' : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  'C' : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
  'D' : [1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
  'E' : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}

df = pd.DataFrame.from_dict(data)

categorical = pd.from_dummies(df)[""].rename("category")
groups = (categorical != categorical.shift()).cumsum().rename("group_num")

index = pd.MultiIndex.from_arrays([
    df.index,
    pd.concat([categorical, groups], axis=1).groupby("category")["group_num"].transform(lambda x: (x != x.shift()).cumsum()),
    groups.groupby(groups).cumcount().add(1).rename("group_id"),
    categorical
])

print(df.set_index(index))

Output:

                                A  B  C  D  E
   group_num group_id category               
0  1         1        D         0  0  0  1  0
1  1         2        D         0  0  0  1  0
2  1         3        D         0  0  0  1  0
3  1         1        A         1  0  0  0  0
4  1         2        A         1  0  0  0  0
5  1         3        A         1  0  0  0  0
6  1         4        A         1  0  0  0  0
7  1         5        A         1  0  0  0  0
8  2         1        D         0  0  0  1  0
9  2         2        D         0  0  0  1  0
10 1         1        C         0  0  1  0  0
11 1         2        C         0  0  1  0  0
12 1         3        C         0  0  1  0  0
13 1         4        C         0  0  1  0  0

Collectives™ on Stack Overflow

Efficient way for additional indexing of pandas dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related