1

assuming I have the following pandas dataframe, where the n columns have a name from u0 to u(n-1) (in this case is n=3).

import pandas as pd

df = pd.DataFrame(np.random.randn(5,3), columns=["u0","u1","u2"])
print(df)
         u0        u1        u2
0 -0.254454 -0.227589 -0.208454
1 -0.071567 -2.878662 -0.094863
2 -0.100024 -2.295788 -0.103415
3  0.091116 -0.143777  0.874170
4 -1.398530 -1.248449 -0.707336

Now I want to calculate n new columns with name pn where in each cell is the value divided by the sum of the row. Example for cell(0,0) is p(0,0) = u(0,0) / (u(0,0) + u(0,1) + u(0,2))

At the moment I'm doing this by applying a function p to each row. The return value is a new dataframe, where I rename the columns and finally merge both dataframe.

def p(row):
    u = row.loc["u0":"u2"]
    return u / u.sum()

df2 = df.apply(p, axis=1)
df2.columns = ["p0","p1","p2"]
df = pd.concat([df, df2], axis=1)

print(df)

         u0        u1        u2         p0          p1          p2
0 -0.254454 -0.227589 -0.208454 0.36850848 0.329601722 0.301889798
1 -0.071567 -2.878662 -0.094863 0.02350241 0.945344837 0.031152753
...

I'm not sure if this is the pythonic way and if it's fast enough. Later I will have many thousands of rows and about 100 columns (but this value is not fixed as shown in this example code).

Thank you very much for any ideas, comments or suggestions?

1 Answer 1

1
import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randn(5,3), columns=["u0","u1","u2"])

# Do the calculation on all rows in one step, then rename the columns
df_1 = df.div(df.sum(axis=1).values.reshape(-1, 1), axis=1)
df_1.columns = df_1.columns.str.replace('u', 'p')

df = df.join(df_1)

Output:

df
Out[37]: 
         u0        u1        u2
0 -0.899546 -0.069913 -0.668208
1  0.554489 -2.039013 -0.823227
2 -1.338628  0.668411  0.170418
3 -0.616199  0.738712 -0.471407
4 -0.559914  0.856356  0.178957

df_1
Out[39]: 
         p0        p1        p2
0  0.549285  0.042690  0.408024
1 -0.240272  0.883550  0.356723
2  2.678337 -1.337362 -0.340974
3  1.766148 -2.117293  1.351145
4 -1.177774  1.801339  0.376435

df
Out[41]: 
         u0        u1        u2        p0        p1        p2
0 -0.899546 -0.069913 -0.668208  0.549285  0.042690  0.408024
1  0.554489 -2.039013 -0.823227 -0.240272  0.883550  0.356723
2 -1.338628  0.668411  0.170418  2.678337 -1.337362 -0.340974
3 -0.616199  0.738712 -0.471407  1.766148 -2.117293  1.351145
4 -0.559914  0.856356  0.178957 -1.177774  1.801339  0.376435
Sign up to request clarification or add additional context in comments.

2 Comments

Do the calculation on all rows in one step is really useful in this case, but still I need the three steps. (1: calculation, 2: rename, 3: join) Is there a more generic solution to apply any function with creating a variable count of columns?
I am not sure I understand your question. For the specific problem you have posted, there is no room to simplify the process further. It is true there are three stages. However they are optimal in terms of performance because they work on the whole data. The approach you have taken with apply still has three steps, however .apply() works row-by-row and is therefore far less efficient than the vectorized approach.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.