0

I am working on a rather large, binary (n-hot encoded) dataframe, which structure is similar to this toy example:

import pandas as pd

data = {
  'A' : [0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
  'B' : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  'C' : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
  'D' : [1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
  'E' : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}

df = pd.DataFrame.from_dict(data)
df
    A  B  C  D  E
0   0  0  0  1  0
1   0  0  0  1  0
2   0  0  0  1  0
3   1  0  0  0  0
4   1  0  0  0  0
5   1  0  0  0  0
6   1  0  0  0  0
7   1  0  0  0  0
8   0  0  0  1  0
9   0  0  0  1  0
10  0  0  1  0  0
11  0  0  1  0  0
12  0  0  1  0  0
13  0  0  1  0  0

To make some extractions and searches more efficient, I would like to generate some additional indexes, apart from the generic one, like this:

          A  B  C  D  E
0   1  D  0  0  0  1  0
1   1  D  0  0  0  1  0
2   1  D  0  0  0  1  0
3   1  A  1  0  0  0  0
4   1  A  1  0  0  0  0
5   1  A  1  0  0  0  0
6   1  A  1  0  0  0  0
7   1  A  1  0  0  0  0
8   2  D  0  0  0  1  0
9   2  D  0  0  0  1  0
10  1  C  0  0  1  0  0
11  1  C  0  0  1  0  0
12  1  C  0  0  1  0  0
13  1  C  0  0  1  0  0

Two new columns show which column contains 1 and what appearance that is. Even better would be to have yet another one:

             A  B  C  D  E
0   1  1  D  0  0  0  1  0
1   1  2  D  0  0  0  1  0
2   1  3  D  0  0  0  1  0
3   1  1  A  1  0  0  0  0
4   1  2  A  1  0  0  0  0
5   1  3  A  1  0  0  0  0
6   1  4  A  1  0  0  0  0
7   1  5  A  1  0  0  0  0
8   2  1  D  0  0  0  1  0
9   2  2  D  0  0  0  1  0
10  1  1  C  0  0  1  0  0
11  1  2  C  0  0  1  0  0
12  1  3  C  0  0  1  0  0
13  1  4  C  0  0  1  0  0

What would be the most efficient way to generate these indexes?

1 Answer 1

2

Pandas 1.5 introduced pd.from_dummies to decode the one hot variables. I've done some playing around with groupbys and cumsums to come up with the following, but there could definitely be better ways to compute some of them (the "group_num" column in particular :) ):

import pandas as pd

data = {
  'A' : [0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
  'B' : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  'C' : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
  'D' : [1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
  'E' : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}

df = pd.DataFrame.from_dict(data)

categorical = pd.from_dummies(df)[""].rename("category")
groups = (categorical != categorical.shift()).cumsum().rename("group_num")

index = pd.MultiIndex.from_arrays([
    df.index,
    pd.concat([categorical, groups], axis=1).groupby("category")["group_num"].transform(lambda x: (x != x.shift()).cumsum()),
    groups.groupby(groups).cumcount().add(1).rename("group_id"),
    categorical
])

print(df.set_index(index))

Output:

                                A  B  C  D  E
   group_num group_id category               
0  1         1        D         0  0  0  1  0
1  1         2        D         0  0  0  1  0
2  1         3        D         0  0  0  1  0
3  1         1        A         1  0  0  0  0
4  1         2        A         1  0  0  0  0
5  1         3        A         1  0  0  0  0
6  1         4        A         1  0  0  0  0
7  1         5        A         1  0  0  0  0
8  2         1        D         0  0  0  1  0
9  2         2        D         0  0  0  1  0
10 1         1        C         0  0  1  0  0
11 1         2        C         0  0  1  0  0
12 1         3        C         0  0  1  0  0
13 1         4        C         0  0  1  0  0
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.