How to create multiple columns using pandas apply function

Question

assuming I have the following pandas dataframe, where the n columns have a name from u0 to u(n-1) (in this case is n=3).

import pandas as pd

df = pd.DataFrame(np.random.randn(5,3), columns=["u0","u1","u2"])
print(df)
         u0        u1        u2
0 -0.254454 -0.227589 -0.208454
1 -0.071567 -2.878662 -0.094863
2 -0.100024 -2.295788 -0.103415
3  0.091116 -0.143777  0.874170
4 -1.398530 -1.248449 -0.707336

Now I want to calculate n new columns with name pn where in each cell is the value divided by the sum of the row. Example for cell(0,0) is p(0,0) = u(0,0) / (u(0,0) + u(0,1) + u(0,2))

At the moment I'm doing this by applying a function p to each row. The return value is a new dataframe, where I rename the columns and finally merge both dataframe.

def p(row):
    u = row.loc["u0":"u2"]
    return u / u.sum()

df2 = df.apply(p, axis=1)
df2.columns = ["p0","p1","p2"]
df = pd.concat([df, df2], axis=1)

print(df)

         u0        u1        u2         p0          p1          p2
0 -0.254454 -0.227589 -0.208454 0.36850848 0.329601722 0.301889798
1 -0.071567 -2.878662 -0.094863 0.02350241 0.945344837 0.031152753
...

I'm not sure if this is the pythonic way and if it's fast enough. Later I will have many thousands of rows and about 100 columns (but this value is not fixed as shown in this example code).

Thank you very much for any ideas, comments or suggestions?

KRKirov · Accepted Answer · 2021-02-06 14:36:28Z

1

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randn(5,3), columns=["u0","u1","u2"])

# Do the calculation on all rows in one step, then rename the columns
df_1 = df.div(df.sum(axis=1).values.reshape(-1, 1), axis=1)
df_1.columns = df_1.columns.str.replace('u', 'p')

df = df.join(df_1)

Output:

df
Out[37]: 
         u0        u1        u2
0 -0.899546 -0.069913 -0.668208
1  0.554489 -2.039013 -0.823227
2 -1.338628  0.668411  0.170418
3 -0.616199  0.738712 -0.471407
4 -0.559914  0.856356  0.178957

df_1
Out[39]: 
         p0        p1        p2
0  0.549285  0.042690  0.408024
1 -0.240272  0.883550  0.356723
2  2.678337 -1.337362 -0.340974
3  1.766148 -2.117293  1.351145
4 -1.177774  1.801339  0.376435

df
Out[41]: 
         u0        u1        u2        p0        p1        p2
0 -0.899546 -0.069913 -0.668208  0.549285  0.042690  0.408024
1  0.554489 -2.039013 -0.823227 -0.240272  0.883550  0.356723
2 -1.338628  0.668411  0.170418  2.678337 -1.337362 -0.340974
3 -0.616199  0.738712 -0.471407  1.766148 -2.117293  1.351145
4 -0.559914  0.856356  0.178957 -1.177774  1.801339  0.376435

answered Feb 6, 2021 at 14:36

KRKirov

4,0142 gold badges20 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Bastian Over a year ago

Do the calculation on all rows in one step is really useful in this case, but still I need the three steps. (1: calculation, 2: rename, 3: join) Is there a more generic solution to apply any function with creating a variable count of columns?

KRKirov Over a year ago

I am not sure I understand your question. For the specific problem you have posted, there is no room to simplify the process further. It is true there are three stages. However they are optimal in terms of performance because they work on the whole data. The approach you have taken with apply still has three steps, however .apply() works row-by-row and is therefore far less efficient than the vectorized approach.

Collectives™ on Stack Overflow

How to create multiple columns using pandas apply function

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related