Update multiple columns per row with loop through pandas dataframe

Question

I've reviewed several posts on here about better ways to loop through dataframes, but can't seem to figure out how to apply them to my specific situation.

I have a dataframe of about 2M rows and I need to calculate six statistics for each row, one per column. There are 3 columns so 18 total. However, the issue is that I need to update those stats using a sample of the dataframe so that the mean/median, etc is different per row.

Here's what I have so far:

r = 0
for i in imputed_df.iterrows():
    t = imputed_df.sample(n=10)
    for (columnName) in cols:
        imputed_df.loc[r,columnName + '_mean'] = t[columnName].mean()
        imputed_df.loc[r,columnName + '_var'] = t[columnName].var()
        imputed_df.loc[r,columnName + '_std'] = t[columnName].std()
        imputed_df.loc[r,columnName + '_skew'] = t[columnName].skew()
        imputed_df.loc[r,columnName + '_kurt'] = t[columnName].kurt()
        imputed_df.loc[r,columnName + '_med'] = t[columnName].median()

But this has been running for two days without finishing. I tried to take a subset of 2000 rows from the original dataframe and even that one has been running for hours.

Is there a better way to do this?

EDIT: Added a sample dataset of what it should look like. each suffixed column should have the calculated value of the subset of 10 rows.

    timestamp   activityID  w2  w3  w4
0   41.21   1.0     -1.34587    9.57245     2.83571
1   41.22   1.0     -1.76211    10.63590    2.59496
2   41.23   1.0     -2.45116    11.09340    2.23671
3   41.24   1.0     -2.42381    11.88590    1.77260
4   41.25   1.0     -2.31581    12.45170    1.50289

A double for loop on a large dataframe will take forever. Can you provide a sample of your data that the code above will run on? It will then be easier to suggest a more efficient way to do this. — Nathaniel
– Nathaniel, Commented Oct 27, 2020 at 19:47

ansev · Accepted Answer · 2020-10-27 19:56:00Z

1

The problem is that you do the operation for each column using unnecessary loops. We could use DataFrame.agg with DataFrame.unstack and Series.set_axis to get correct names of columns.

Setup

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 10, (10, 100))).add_prefix('col')

new_serie = df.agg(['sum', 'mean', 
                    'var', 'std', 
                    'skew', 'kurt', 'median']).unstack()
new_df = pd.concat([df, new_serie.set_axis([f'{x}_{y}'
                                            for x, y in new_serie.index])
                                  .to_frame().T], axis=1)

# if new_df already exist:
#new_df.loc[0, :] = new_serie.set_axis([f'{x}_{y}' for x, y in new_serie.index])

   col0  col1  col2  col3  col4  col5  col6  col7  col8  col9  ...  \
0     8     7     6     7     6     5     8     7     8     4  ...   
1     8     1     8     7     0     8     8     4     6     1  ...   
2     5     6     3     5     4     9     3     0     2     5  ...   
3     3     3     3     3     5     4     5     1     3     5  ...   
4     7     9     4     5     6     7     0     3     4     6  ...   
5     0     5     2     0     8     0     3     7     6     5  ...   
6     7     0     1     4     8     9     4     9     2     9  ...   
7     0     6     1     0     6     1     3     0     3     4  ...   
8     3     6     1     8     3     0     7     6     8     6  ...   
9     2     5     8     5     8     4     9     1     9     9  ...   

   col98_skew  col98_kurt  col98_median  col99_sum  col99_mean  col99_var  \
0    0.456435   -0.939607           3.0       39.0         3.9   6.322222   
1         NaN         NaN           NaN        NaN         NaN        NaN   
2         NaN         NaN           NaN        NaN         NaN        NaN   
3         NaN         NaN           NaN        NaN         NaN        NaN   
4         NaN         NaN           NaN        NaN         NaN        NaN   
5         NaN         NaN           NaN        NaN         NaN        NaN   
6         NaN         NaN           NaN        NaN         NaN        NaN   
7         NaN         NaN           NaN        NaN         NaN        NaN   
8         NaN         NaN           NaN        NaN         NaN        NaN   
9         NaN         NaN           NaN        NaN         NaN        NaN   

   col99_std  col99_skew  col99_kurt  col99_median  
0   2.514403    0.402601    1.099343           4.0  
1        NaN         NaN         NaN           NaN  
2        NaN         NaN         NaN           NaN  
3        NaN         NaN         NaN           NaN  
4        NaN         NaN         NaN           NaN  
5        NaN         NaN         NaN           NaN  
6        NaN         NaN         NaN           NaN  
7        NaN         NaN         NaN           NaN  
8        NaN         NaN         NaN           NaN  
9        NaN         NaN         NaN           NaN

edited Oct 27, 2020 at 19:56

answered Oct 27, 2020 at 19:48

ansev

31k5 gold badges21 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

seve Over a year ago

How would I then loop through the remainder? for i in imputed_df.iterrows(): ...

ansev Over a year ago

I don't know exactly what you are looking for, try to provide the expected output, what you just have to avoid is using iterrows

seve Over a year ago

exactly what you provided was perfect, I just need to do it for all of the rows that are in the frame. So calculate those metrics for each row, then concat it to the original DF, rather than just the first series.

ansev Over a year ago

but you can do this for all rows and columns of the DataFrame.

Collectives™ on Stack Overflow

Update multiple columns per row with loop through pandas dataframe

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related