28

Is there a way to apply a list of functions to each column in a DataFrame like the DataFrameGroupBy.agg function does? I found an ugly way to do it like this:

df=pd.DataFrame(dict(one=np.random.uniform(0,10,100), two=np.random.uniform(0,10,100)))
df.groupby(np.ones(len(df))).agg(['mean','std'])

        one                 two
       mean       std      mean       std
1  4.802849  2.729528  5.487576  2.890371

4 Answers 4

33

For Pandas 0.20.0 or newer, use df.agg (thanks to ayhan for pointing this out):

In [11]: df.agg(['mean', 'std'])
Out[11]: 
           one       two
mean  5.147471  4.964100
std   2.971106  2.753578

For older versions, you could use

In [61]: df.groupby(lambda idx: 0).agg(['mean','std'])
Out[61]: 
        one               two          
       mean       std    mean       std
0  5.147471  2.971106  4.9641  2.753578

Another way would be:

In [68]: pd.DataFrame({col: [getattr(df[col], func)() for func in ('mean', 'std')] for col in df}, index=('mean', 'std'))
Out[68]: 
           one       two
mean  5.147471  4.964100
std   2.971106  2.753578
Sign up to request clarification or add additional context in comments.

3 Comments

agg is now available as a DataFrame method so this works without the trick too: df.agg(['mean', 'std']).
I have notice that using agg is a lot slower than just applying a function in the df. i.e df.sum(), df. mean() instead of df.agg(['sum'], 'mean']). is there a reason for that or am I doing something wrong?
@saias: It might be worth asking this as a new question. My guess is that df.agg(['sum','mean']) ultimately calls pandas.core.base.SelectionMixin._aggregate which handles many different cases for input and output. All that extra case handling slows down the performance of df.agg. In this case, you can bypass a lot of that code by building the desired DataFrame yourself with something like pd.DataFrame({'sum':df.sum(), 'mean':df.mean()}).T.
17

In the general case where you have arbitrary functions and column names, you could do this:

df.apply(lambda r: pd.Series({'mean': r.mean(), 'std': r.std()})).transpose()

         mean       std
one  5.366303  2.612738
two  4.858691  2.986567

Comments

2

I tried to apply three functions into a column and it works

#removing new line character
rem_newline = lambda x : re.sub('\n',' ',x).strip()

#character lower and removing spaces
lower_strip = lambda x : x.lower().strip()

df = df['users_name'].apply(lower_strip).apply(rem_newline).str.split('(',n=1,expand=True)

Comments

1

I am using pandas to analyze Chilean legislation drafts. In my dataframe, the list of authors are stored as a string. The answer above did not work for me (using pandas 0.20.3). So I used my own logic and came up with this:

df.authors.apply(eval).apply(len).sum()

Concatenated applies! A pipeline!! The first apply transforms

"['Barros Montero: Ramón', 'Bellolio Avaria: Jaime', 'Gahona Salazar: Sergio']"

into the obvious list, the second apply counts the number of lawmakers involved in the project. I want the size of every pair (lawmaker, project number) (so I can presize an array where I will study which parties work on what).

Interestingly, this works! Even more interestingly, that last call fails if one gets too ambitious and does this instead:

df.autores.apply(eval).apply(len).apply(sum)

with an error:

TypeError: 'int' object is not iterable

coming from deep within /site-packages/pandas/core/series.py in apply

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.