0

I have the following dataframe.

data = [[0, 0, 0, 0, 1], [0, 1, 0, 0, 1], [1, 1, 0, 0, 0], [0, 1, 0, 0, 0]]
labels = ['cat', 'dog', 'duck', 'fish', 'horse']
df = pd.DataFrame(data, columns = labels)
df: 
    cat dog duck fish horse
0    0   0   0    0    1
1    0   1   0    0    1
2    1   1   0    0    0
3    0   1   0    0    0

I have got the 0's and 1's from another dataframe based on a certain condition. I want to combine the column names for values corresponding to true values. i.e 1's into a list and put it into a new column at the end of the dataframe.

I want my result to look like this.

   cat  dog  duck  fish horse   result
0   0    0    0     0    1     [horse]
1   0    1    0     0    1     [dog, horse]
2   1    1    0     0    0     [cat, dog]
3   0    1    0     0    0     [dog] 

I have about 108 columns in my original dataframe and around 3500 rows. What is the best way to do this?

PS: I haven't been successful in finding a way to do this.

Thanks

2 Answers 2

1

Filter columns names if 1 in values per rows:

c = df.columns.to_numpy()
df['result'] = df.apply(lambda x: list(c[x == 1]), axis=1)

Alternative is faster:

c = df.columns.to_numpy()
df['result'] = [c[x == 1] for x in df.to_numpy()]

print (df)
   cat  dog  duck  fish  horse        result
0    0    0     0     0      1       [horse]
1    0    1     0     0      1  [dog, horse]
2    1    1     0     0      0    [cat, dog]
3    0    1     0     0      0         [dog]

#[4000 rows x 100 columns]
df = pd.concat([df] * 1000, ignore_index=True)
df = pd.concat([df] * 20, ignore_index=True, axis=1).add_prefix('test')
    

In [38]: %%timeit
    ...: c = df.columns.to_numpy()
    ...: df.apply(lambda x: list(c[x == 1]), axis=1)
    ...: 
    ...: 
526 ms ± 33.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [62]: %%timeit
    ...: c = df.columns.to_numpy()
    ...: [c[x == 1] for x in df.to_numpy()]
    ...: 
    ...: 
12.1 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Anothter solutions:

In [58]: %%timeit
    ...: df.dot(df.columns + ',').str.strip(',').str.split(',')
    ...: 
50.7 ms ± 5.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [63]: %%timeit
    ...: df.mask(df.eq(0)).stack().reset_index(-1).groupby(level=0).agg({'level_1' : list }).values
    ...: 
    ...: 
162 ms ± 6.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Sign up to request clarification or add additional context in comments.

1 Comment

dot solution is also working without timeit. Maybe it's because df['result'] = as it'll add the new column not so sure. Can you tell me the exact reason if you've some idea?
0

You can try:

df['result'] = df.dot(df.columns + ',').str.strip(',').str.split(',')

Alternative:

df['result'] = df.mask(df.eq(0)).stack().reset_index(-1).groupby(level=0).agg({'level_1' : list }).values

OUTPUT:

   cat  dog  duck  fish  horse        result
0    0    0     0     0      1       [horse]
1    0    1     0     0      1  [dog, horse]
2    1    1     0     0      0    [cat, dog]
3    0    1     0     0      0         [dog]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.