Combine values from multiple columns into a list in each row using python

Question

I have the following dataframe.

data = [[0, 0, 0, 0, 1], [0, 1, 0, 0, 1], [1, 1, 0, 0, 0], [0, 1, 0, 0, 0]]
labels = ['cat', 'dog', 'duck', 'fish', 'horse']
df = pd.DataFrame(data, columns = labels)

df: 
    cat dog duck fish horse
0    0   0   0    0    1
1    0   1   0    0    1
2    1   1   0    0    0
3    0   1   0    0    0

I have got the 0's and 1's from another dataframe based on a certain condition. I want to combine the column names for values corresponding to true values. i.e 1's into a list and put it into a new column at the end of the dataframe.

I want my result to look like this.

   cat  dog  duck  fish horse   result
0   0    0    0     0    1     [horse]
1   0    1    0     0    1     [dog, horse]
2   1    1    0     0    0     [cat, dog]
3   0    1    0     0    0     [dog]

I have about 108 columns in my original dataframe and around 3500 rows. What is the best way to do this?

PS: I haven't been successful in finding a way to do this.

Thanks

jezrael · Accepted Answer · 2021-06-18 07:01:11Z

1

Filter columns names if 1 in values per rows:

c = df.columns.to_numpy()
df['result'] = df.apply(lambda x: list(c[x == 1]), axis=1)

Alternative is faster:

c = df.columns.to_numpy()
df['result'] = [c[x == 1] for x in df.to_numpy()]

print (df)
   cat  dog  duck  fish  horse        result
0    0    0     0     0      1       [horse]
1    0    1     0     0      1  [dog, horse]
2    1    1     0     0      0    [cat, dog]
3    0    1     0     0      0         [dog]

#[4000 rows x 100 columns]
df = pd.concat([df] * 1000, ignore_index=True)
df = pd.concat([df] * 20, ignore_index=True, axis=1).add_prefix('test')
    

In [38]: %%timeit
    ...: c = df.columns.to_numpy()
    ...: df.apply(lambda x: list(c[x == 1]), axis=1)
    ...: 
    ...: 
526 ms ± 33.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [62]: %%timeit
    ...: c = df.columns.to_numpy()
    ...: [c[x == 1] for x in df.to_numpy()]
    ...: 
    ...: 
12.1 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Anothter solutions:

In [58]: %%timeit
    ...: df.dot(df.columns + ',').str.strip(',').str.split(',')
    ...: 
50.7 ms ± 5.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [63]: %%timeit
    ...: df.mask(df.eq(0)).stack().reset_index(-1).groupby(level=0).agg({'level_1' : list }).values
    ...: 
    ...: 
162 ms ± 6.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Jun 18, 2021 at 7:01

answered Jun 18, 2021 at 6:08

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Nk03 Over a year ago

dot solution is also working without timeit. Maybe it's because df['result'] = as it'll add the new column not so sure. Can you tell me the exact reason if you've some idea?

Nk03 · Accepted Answer · 2021-06-18 06:07:10Z

0

You can try:

df['result'] = df.dot(df.columns + ',').str.strip(',').str.split(',')

Alternative:

df['result'] = df.mask(df.eq(0)).stack().reset_index(-1).groupby(level=0).agg({'level_1' : list }).values

OUTPUT:

   cat  dog  duck  fish  horse        result
0    0    0     0     0      1       [horse]
1    0    1     0     0      1  [dog, horse]
2    1    1     0     0      0    [cat, dog]
3    0    1     0     0      0         [dog]

edited Jun 18, 2021 at 6:07

answered Jun 18, 2021 at 6:01

Nk03

15k2 gold badges11 silver badges24 bronze badges

Collectives™ on Stack Overflow

Combine values from multiple columns into a list in each row using python

2 Answers 2

1 Comment

OUTPUT:

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

OUTPUT:

Comments

Your Answer

Sign up or log in

Post as a guest

Related