1

I have dataframe where I went to do multiple column aggregations in pandas.

import pandas as pd
import numpy as np
df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
                'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd'],
                'lat': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                'long': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30]})

df2 = df.groupby(['ser_no', 'CTRY_NM']).lat.agg({'avg_lat': np.mean})

With this code, I get the mean for lat. I would also like to find the mean for long.

I tried df2 = df.groupby(['ser_no', 'CTRY_NM']).lat.agg({'avg_lat': np.mean}).long.agg({'avg_long': np.mean}) but this produces

AttributeError: 'DataFrame' object has no attribute 'long'

If I just do avg_long, the code works as well.

df2 = df.groupby(['ser_no', 'CTRY_NM']).long.agg({'avg_long': np.mean})

In[2]: df2
Out[42]: 
                avg_long
ser_no CTRY_NM          
1      a            21.5
       b            23.0
2      a            26.0
       b            27.0
       e            24.5
3      b            28.5
       d            30.0

Is there a way to do this in one step or is this something I have to do separately and join back later?

2 Answers 2

2

I think more simplier is use GroupBy.mean:

print df.groupby(['ser_no', 'CTRY_NM']).mean()
                 lat  long
ser_no CTRY_NM            
1      a         1.5  21.5
       b         3.0  23.0
2      a         6.0  26.0
       b         7.0  27.0
       e         4.5  24.5
3      b         8.5  28.5
       d        10.0  30.0

Ir you need define columns for aggregating:

print df.groupby(['ser_no', 'CTRY_NM']).agg({'lat' : 'mean', 'long' : 'mean'})
                 lat  long
ser_no CTRY_NM            
1      a         1.5  21.5
       b         3.0  23.0
2      a         6.0  26.0
       b         7.0  27.0
       e         4.5  24.5
3      b         8.5  28.5
       d        10.0  30.0

More info in docs.

EDIT:

If you need rename column names - remove multiindex in columns, you can use list comprehension:

import pandas as pd

df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
                'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd'],
                'lat': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                'long': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
                'date':pd.date_range(pd.to_datetime('2016-02-24'),
                                     pd.to_datetime('2016-02-28'), freq='10H')})

print df               
  CTRY_NM                date  lat  long  ser_no
0       a 2016-02-24 00:00:00    1    21       1
1       a 2016-02-24 10:00:00    2    22       1
2       b 2016-02-24 20:00:00    3    23       1
3       e 2016-02-25 06:00:00    4    24       2
4       e 2016-02-25 16:00:00    5    25       2
5       a 2016-02-26 02:00:00    6    26       2
6       b 2016-02-26 12:00:00    7    27       2
7       b 2016-02-26 22:00:00    8    28       3
8       b 2016-02-27 08:00:00    9    29       3
9       d 2016-02-27 18:00:00   10    30       3              

df2=df.groupby(['ser_no','CTRY_NM']).agg({'lat':'mean','long':'mean','date':[min,max,'count']})
df2.columns = ['_'.join(col) for col in df2.columns]
print df2
                lat_mean            date_min            date_max  date_count  \
ser_no CTRY_NM                                                                 
1      a             1.5 2016-02-24 00:00:00 2016-02-24 10:00:00           2   
       b             3.0 2016-02-24 20:00:00 2016-02-24 20:00:00           1   
2      a             6.0 2016-02-26 02:00:00 2016-02-26 02:00:00           1   
       b             7.0 2016-02-26 12:00:00 2016-02-26 12:00:00           1   
       e             4.5 2016-02-25 06:00:00 2016-02-25 16:00:00           2   
3      b             8.5 2016-02-26 22:00:00 2016-02-27 08:00:00           2   
       d            10.0 2016-02-27 18:00:00 2016-02-27 18:00:00           1   

                long_mean  
ser_no CTRY_NM             
1      a             21.5  
       b             23.0  
2      a             26.0  
       b             27.0  
       e             24.5  
3      b             28.5  
       d             30.0  
Sign up to request clarification or add additional context in comments.

2 Comments

I appreciate the answer but this may cause issues since in the real data set I have columns I don't want to mean. I just made a toy problem for here.
Well if you have more columns, simply exclude them by subsetting the dataframe.
1

You are getting the error because you are first selecting the lat column of the dataframe and doing operations on that column. Getting the long column through that series is not possible, you need the dataframe.

df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean)

would do the same operation for both columns. If you want the column names changed, you can rename the columns afterwards:

df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean).rename(columns = {"lat": "avg_lat", "long": "avg_long"})

In [22]:
df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean).rename(columns = {"lat": "avg_lat", "long": "avg_long"})
df2
Out[22]:
                    avg_lat avg_long
ser_no  CTRY_NM     
1       a           1.5     21.5
        b           3.0     23.0
2       a           6.0     26.0
        b           7.0     27.0
        e           4.5     24.5
3       b           8.5     28.5
        d           10.0    30.0

5 Comments

Can I string together a dataframe agg and series agg? That is, then do a separate agg on a date column all in one?
If I understood you correctly, yes. .agg accepts a dictionary but it works differently then you tried to use. Each key will be a column and each value is a function that you want to apply to that column. df.groupby(['ser_no', 'CTRY_NM']).agg({"lat": np.mean, "long": np.mean, "date": np.max}) would take the averages of lat and long but returns the maximum date for each group, for example.
You understood correctly but what if for date, I wanted date.agg({ 'start_dt': min, 'end_dt': max, 'number_of_dt': 'count'}) would multiple arguments or is it limited to one?
Would it look like: df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg({'lat':np.mean, 'long':np.mean, 'date': {'start_dt': min, 'end_dt': max, 'number_of_dt': 'count'}}).rename(columns = {"lat": "avg_lat", "long": "avg_long"})?
You can specify different functions for the same column in a list: df.groupby(['ser_no', 'CTRY_NM']).agg({"lat": np.mean, "long": np.mean, "date": ['max', 'min', 'count']}) this would take the max, min, and count of the dates for each group. But again, you need to change the column names later as far as I know.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.