Pandas: return a Dataframe with multiple aggregate values conditioned on another value [closed]

Question

Closed. This question needs details or clarity. It is not currently accepting answers.

Want to improve this question? As written, this question is lacking some of the information it needs to be answered. If the author adds details in comments, consider editing them into the question. Once there's sufficient detail to answer, vote to reopen the question.

Closed 9 years ago.

Improve this question

I'm trying to do something well beyond my Pandas level and have spent far too much time getting this wrong. In this example I need to return individual Dataframes for each of the teams. The dataframes would show the mean cost, mean area, and sum of size, for each grade.

Because I need to produce separate tables, I probably need to pass single team names into a function over and over. To be clear, I'm happy to pass the team names into a function (or similar) manually to produce each table.

    team      grade   cost   area   size
0   man utd   1       52300  5      1045
1   chelsea   3       52000  42     957
2   arsenal   2       25000  20     1099
3   man utd   1       61600  20     1400
4   man utd   2       43000  43     1592
5   arsenal   2       23400  78     1006
6   man utd   2       52300  89     987
7   chelsea   4       62000  30     849
8   arsenal   1       62000  46     973
9   arsenal   2       73000  78     1005

The man utd dataframe would look like this for example:

grade  mean_cost    mean_area   size
1      56590        12.5        2445
2      47650        66          2579

And...? What is your question? Surely you have code, but you're stuck somewhere, so give us something specific to help you with. — Henry Keiter
– Henry Keiter, Commented Apr 18, 2016 at 21:03

unutbu · Accepted Answer · 2016-04-18 22:10:13Z

1

Use groupby/agg to group by both the team and grade, and the aggregate the cost, area and size columns. Note that agg can accept a dict whose keys are column names and whose values are aggregation functions (such as mean or sum). Thus you can specify aggregation functions on a per-column basis.

In [120]: df.groupby(['team', 'grade']).agg({'cost':'mean', 'area':'mean', 'size':'sum'}).rename(columns={'cost':'mean_cost', 'area':'mean_area'})
Out[120]: 
               size     mean_cost  mean_area
team    grade                               
arsenal 1       973  62000.000000  46.000000
        2      3110  40466.666667  58.666667
chelsea 3       957  52000.000000  42.000000
        4       849  62000.000000  30.000000
man utd 1      2445  56950.000000  12.500000
        2      2579  47650.000000  66.000000

groupby returns an iterable. Therefore, to make a dict mapping team names to DataFrames you could use:

dfs = {team:grp for team, grp in result.reset_index().groupby('team')}

For example,

import pandas as pd

df = pd.DataFrame(
    {'area': [5, 42, 20, 20, 43, 78, 89, 30, 46, 78],
     'cost': [52300, 52000, 25000, 61600, 43000, 23400, 52300, 62000, 62000, 73000], 
     'grade': [1, 3, 2, 1, 2, 2, 2, 4, 1, 2], 'size': [1045, 957, 1099, 1400, 1592, 1006, 987, 849, 973, 1005], 
     'team': ['man utd', 'chelsea', 'arsenal', 'man utd', 'man utd', 'arsenal', 'man utd', 'chelsea', 'arsenal', 'arsenal']})

result =  df.groupby(['team', 'grade']).agg({'cost':'mean', 'area':'mean', 'size':'sum'}).rename(columns={'cost':'mean_cost', 'area':'mean_area'})

dfs = {team:grp.drop('team', axis=1) 
       for team, grp in result.reset_index().groupby('team')}

for team, grp in dfs.items():
    print('{}:\n{}\n'.format(team, grp))

yields

chelsea:
   grade  mean_cost  mean_area  size
2      3      52000         42   957
3      4      62000         30   849

arsenal:
   grade     mean_cost  mean_area  size
0      1  62000.000000  46.000000   973
1      2  40466.666667  58.666667  3110

man utd:
   grade  mean_cost  mean_area  size
4      1      56950       12.5  2445
5      2      47650       66.0  2579

Beware that for better performance try to avoid breaking up DataFrames into smaller DataFrames, because once you use a dict or a list, you are forced to use Python loops instead of the faster implicit C-compiled loops used by Pandas/NumPy methods.

So for computation try to stick with the result DataFrame. Use the dfs dict only if you have to do something like print the DataFrames separately.

edited Apr 18, 2016 at 22:10

answered Apr 18, 2016 at 21:04

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

RDJ Over a year ago

Most insightful and informative answer I've ever had on Stack Overflow. Thanks.

RDJ Over a year ago

Is there also a way to print the dataframes individually in the pandas table style. Ideally need to screenshot them for presentation purposes.

unutbu Over a year ago

To help build Stackoverflow's stable of good questions paired with good answers, please ask this as a new question.

RDJ Over a year ago

Agreed: question is here

Collectives™ on Stack Overflow

Pandas: return a Dataframe with multiple aggregate values conditioned on another value [closed]

1 Answer 1

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Linked

Related