Python: doing multiple column aggregation in pandas

Question

I have dataframe where I went to do multiple column aggregations in pandas.

import pandas as pd
import numpy as np
df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
                'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd'],
                'lat': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                'long': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30]})

df2 = df.groupby(['ser_no', 'CTRY_NM']).lat.agg({'avg_lat': np.mean})

With this code, I get the mean for lat. I would also like to find the mean for long.

I tried df2 = df.groupby(['ser_no', 'CTRY_NM']).lat.agg({'avg_lat': np.mean}).long.agg({'avg_long': np.mean}) but this produces

AttributeError: 'DataFrame' object has no attribute 'long'

If I just do avg_long, the code works as well.

df2 = df.groupby(['ser_no', 'CTRY_NM']).long.agg({'avg_long': np.mean})

In[2]: df2
Out[42]: 
                avg_long
ser_no CTRY_NM          
1      a            21.5
       b            23.0
2      a            26.0
       b            27.0
       e            24.5
3      b            28.5
       d            30.0

Is there a way to do this in one step or is this something I have to do separately and join back later?

jezrael · Accepted Answer · 2016-04-01 05:46:06Z

I think more simplier is use GroupBy.mean:

print df.groupby(['ser_no', 'CTRY_NM']).mean()
                 lat  long
ser_no CTRY_NM            
1      a         1.5  21.5
       b         3.0  23.0
2      a         6.0  26.0
       b         7.0  27.0
       e         4.5  24.5
3      b         8.5  28.5
       d        10.0  30.0

Ir you need define columns for aggregating:

print df.groupby(['ser_no', 'CTRY_NM']).agg({'lat' : 'mean', 'long' : 'mean'})
                 lat  long
ser_no CTRY_NM            
1      a         1.5  21.5
       b         3.0  23.0
2      a         6.0  26.0
       b         7.0  27.0
       e         4.5  24.5
3      b         8.5  28.5
       d        10.0  30.0

More info in docs.

EDIT:

If you need rename column names - remove multiindex in columns, you can use list comprehension:

import pandas as pd

df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
                'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd'],
                'lat': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                'long': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
                'date':pd.date_range(pd.to_datetime('2016-02-24'),
                                     pd.to_datetime('2016-02-28'), freq='10H')})

print df               
  CTRY_NM                date  lat  long  ser_no
0       a 2016-02-24 00:00:00    1    21       1
1       a 2016-02-24 10:00:00    2    22       1
2       b 2016-02-24 20:00:00    3    23       1
3       e 2016-02-25 06:00:00    4    24       2
4       e 2016-02-25 16:00:00    5    25       2
5       a 2016-02-26 02:00:00    6    26       2
6       b 2016-02-26 12:00:00    7    27       2
7       b 2016-02-26 22:00:00    8    28       3
8       b 2016-02-27 08:00:00    9    29       3
9       d 2016-02-27 18:00:00   10    30       3              

df2=df.groupby(['ser_no','CTRY_NM']).agg({'lat':'mean','long':'mean','date':[min,max,'count']})
df2.columns = ['_'.join(col) for col in df2.columns]

print df2
                lat_mean            date_min            date_max  date_count  \
ser_no CTRY_NM                                                                 
1      a             1.5 2016-02-24 00:00:00 2016-02-24 10:00:00           2   
       b             3.0 2016-02-24 20:00:00 2016-02-24 20:00:00           1   
2      a             6.0 2016-02-26 02:00:00 2016-02-26 02:00:00           1   
       b             7.0 2016-02-26 12:00:00 2016-02-26 12:00:00           1   
       e             4.5 2016-02-25 06:00:00 2016-02-25 16:00:00           2   
3      b             8.5 2016-02-26 22:00:00 2016-02-27 08:00:00           2   
       d            10.0 2016-02-27 18:00:00 2016-02-27 18:00:00           1   

                long_mean  
ser_no CTRY_NM             
1      a             21.5  
       b             23.0  
2      a             26.0  
       b             27.0  
       e             24.5  
3      b             28.5  
       d             30.0

I appreciate the answer but this may cause issues since in the real data set I have columns I don't want to mean. I just made a toy problem for here.
Well if you have more columns, simply exclude them by subsetting the dataframe.

score 1 · Accepted Answer · 2016-03-31 17:59:50Z

1

You are getting the error because you are first selecting the lat column of the dataframe and doing operations on that column. Getting the long column through that series is not possible, you need the dataframe.

df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean)

would do the same operation for both columns. If you want the column names changed, you can rename the columns afterwards:

df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean).rename(columns = {"lat": "avg_lat", "long": "avg_long"})

In [22]:
df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean).rename(columns = {"lat": "avg_lat", "long": "avg_long"})
df2
Out[22]:
                    avg_lat avg_long
ser_no  CTRY_NM     
1       a           1.5     21.5
        b           3.0     23.0
2       a           6.0     26.0
        b           7.0     27.0
        e           4.5     24.5
3       b           8.5     28.5
        d           10.0    30.0

edited Mar 31, 2016 at 17:59

answered Mar 31, 2016 at 17:46

user2285236

5 Comments

dustin Over a year ago

Can I string together a dataframe agg and series agg? That is, then do a separate agg on a date column all in one?

user2285236 Over a year ago

If I understood you correctly, yes. .agg accepts a dictionary but it works differently then you tried to use. Each key will be a column and each value is a function that you want to apply to that column. df.groupby(['ser_no', 'CTRY_NM']).agg({"lat": np.mean, "long": np.mean, "date": np.max}) would take the averages of lat and long but returns the maximum date for each group, for example.

dustin Over a year ago

You understood correctly but what if for date, I wanted date.agg({ 'start_dt': min, 'end_dt': max, 'number_of_dt': 'count'}) would multiple arguments or is it limited to one?

dustin Over a year ago

Would it look like:

df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg({'lat':np.mean, 'long':np.mean, 'date': {'start_dt': min, 'end_dt': max, 'number_of_dt': 'count'}}).rename(columns = {"lat": "avg_lat", "long": "avg_long"})

?

user2285236 Over a year ago

You can specify different functions for the same column in a list: df.groupby(['ser_no', 'CTRY_NM']).agg({"lat": np.mean, "long": np.mean, "date": ['max', 'min', 'count']}) this would take the max, min, and count of the dates for each group. But again, you need to change the column names later as far as I know.

Collectives™ on Stack Overflow

Python: doing multiple column aggregation in pandas

2 Answers 2

2 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related