returning aggregated dataframe from pandas groupby

Question

I'm trying to wrap my head around Pandas groupby methods. I'd like to write a function that does some aggregation functions and then returns a Pandas DataFrame. Here's a grossly simplified example using sum(). I know there are easier ways to do simple sums, in real life my function is more complex:

import pandas as pd
df = pd.DataFrame({'col1': ['A', 'A', 'B', 'B'], 'col2':[1.0, 2, 3, 4]})

In [3]: df
Out[3]: 
  col1  col2
0    A     1
1    A     2
2    B     3
3    B     4

def func2(df):
    dfout = pd.DataFrame({ 'col1' : df['col1'].unique() ,
                           'someData': sum(df['col2']) })
    return  dfout

t = df.groupby('col1').apply(func2)

In [6]: t
Out[6]: 
       col1  someData
col1                 
A    0    A         3
B    0    B         7

I did not expect to have col1 in there twice nor did I expect that mystery index looking thing. I really thought I would just get col1 & someData.

In my real life application I'm grouping by more than one column and really would like to get back a DataFrame and not a Series object.
Any ideas for a solution or an explanation on what Pandas is doing in my example above?

----- added info -----

I should have started with this example, I think:

In [13]: import pandas as pd

In [14]: df = pd.DataFrame({'col1':['A','A','A','B','B','B'], 'col2':['C','D','D','D','C','C'], 'col3':[.1,.2,.4,.6,.8,1]})

In [15]: df
Out[15]: 
  col1 col2  col3
0    A    C   0.1
1    A    D   0.2
2    A    D   0.4
3    B    D   0.6
4    B    C   0.8
5    B    C   1.0

In [16]: def func3(df):
   ....:         dfout =  sum(df['col3']**2)
   ....:         return  dfout
   ....: 

In [17]: t = df.groupby(['col1', 'col2']).apply(func3)

In [18]: t
Out[18]: 
col1  col2
A     C       0.01
      D       0.20
B     C       1.64
      D       0.36

In the above illustration the result of the apply() function is a Pandas Series. And it lacks the groupby columns from the df.groupby. The essence of what I'm struggling with is how do I create a function which I apply to a groupby which returns both the result of the function AND the columns on which it was grouped?

----- yet another update ------

It appears that if I then do this:

 pd.DataFrame(t).reset_index()

I get back a dataframe which is really close to what I was after.

btw, this tutorial by one of the pandas programmers helped me understand the groupby and aggregation mechanics of pandas: youtube.com/watch?v=MxRMXhjXZos — Zelazny7
– Zelazny7, Commented Feb 21, 2013 at 14:28
In the example you've appended, what's the purpose of the groupby (it'll just find dupes), you can just do an apply to df itself and add that as a column: df['func3'] = df.apply(lambda row: row['col2'] ** 2, axis=1). ? — Andy Hayden
– Andy Hayden, Commented Feb 21, 2013 at 15:19
The data is a bit too simple for the example, I'm afraid. I'll update the example. — JD Long
– JD Long, Commented Feb 21, 2013 at 15:23
I don't can't see an example where it makes sense to groupby all columns and apply, rather than just apply (DataFrames apply can be very non-trivial and save to multiple columns). (Also you don't need to create a dfout return variable, you can just return the calculation e.g. return df['col3']**2 :) ) — Andy Hayden
– Andy Hayden, Commented Feb 21, 2013 at 15:35
example updated... and now it works! Geesh. It appears that when the apply is on every row it does not return the keys, but if the apply results in aggregation it does return the keys — JD Long
– JD Long, Commented Feb 21, 2013 at 15:39

Andy Hayden · Accepted Answer · 2013-02-21 14:27:46Z

8

The reason you are seeing the columns with 0s is because the output of .unique() is an array.

The best way to understand how your apply is going to work is to inspect each action group-wise:

In [11] :g = df.groupby('col1')

In [12]: g.get_group('A')
Out[12]: 
  col1  col2
0    A     1
1    A     2

In [13]: g.get_group('A')['col1'].unique()
Out[13]: array([A], dtype=object)

In [14]: sum(g.get_group('A')['col2'])
Out[14]: 3.0

The majority of the time you want this to be an aggregated value.

The output of grouped.apply will always have the group labels as an index (the unique values of 'col1'), so your example construction of col1 seems a little obtuse to me.

Note: To pop 'col1' (the index) back to a column you can call reset_index, so in this case.

In [15]: g.sum().reset_index()
Out[15]: 
  col1  col2
0    A     3
1    B     7

answered Feb 21, 2013 at 14:27

Andy Hayden

378k110 gold badges640 silver badges546 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

JD Long Over a year ago

for an arbitrary function which I am applying, the groupby seems to drop my grouping columns from the result and returns only a Series of answers. Clearly using the sum() method gets around that, but it's not helpful for custom functions which are not implemented as groupby methods. I added an example to my question to illustrate better.

Andy Hayden Over a year ago

@JDLong are you are groupby on every column? (to me, this seems a strange thing to do, but I agree the output is a little weird: not having the MultiIndex of the columns) :s

JD Long Over a year ago

nope, in real life I might group on 3 columns and then have 10 more which I do calculations on. But when I output I want to keep the groupby keys in the result.

Andy Hayden Over a year ago

@JDLong I see, that's strange! I thought reset_index works in that case. (Could you give an example where it doesn't?)

Collectives™ on Stack Overflow

returning aggregated dataframe from pandas groupby

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related