3

Suppose I have the following dataframe :

import pandas as pd
df = pd.DataFrame(
    {
        'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
        'office_id': list(range(1, 7)) * 2,
        'sales': [pd.np.random.randint(100000, 999999) for _ in range(12)]
    }
)

Here it is :

    office_id   sales state
0           1  903325    CA
1           2  364594    WA
2           3  737728    CO
3           4  239378    AZ
4           5  833003    CA
5           6  501536    WA
6           1  920821    CO
7           2  879602    AZ
8           3  661818    CA
9           4  548888    WA
10          5  842459    CO
11          6  906791    AZ

Now I do a groupby operation on office_id and states :

df.groupby(["office_id", "state"]).aggregate({"sales": "sum"})

This lead to :

                  sales
office_id state
1         CA     903325
          CO     920821
2         AZ     879602
          WA     364594
3         CA     661818
          CO     737728
4         AZ     239378
          WA     548888
5         CA     833003
          CO     842459
6         AZ     906791
          WA     501536

Is it possible to add a row, for each office_id, with a new index total for example which is the sum over each state of the sales column ?

I can compute it by grouping by "office_id" and sum but I obtain a new DataFrame and I do not succeed in merging it.

3 Answers 3

3

Pandas has built-in functionality to do this with pivot_table by setting the margins parameter to True. And it only sorts correctly because 'total' is lowercase and uppercase comes first.

df.pivot_table(index='office_id', columns='state', margins=True,
               margins_name='total', aggfunc='sum').stack()

                     sales
office_id state           
1         CA      415727.0
          CO      240142.0
          total   655869.0
2         AZ      126350.0
          WA      385698.0
          total   512048.0
3         CA      387320.0
          CO      487075.0
          total   874395.0
4         AZ      978018.0
          WA      878368.0
          total  1856386.0
5         CA      105057.0
          CO      852025.0
          total   957082.0
6         AZ      130853.0
          WA      435940.0
          total   566793.0
total     AZ     1235221.0
          CA      908104.0
          CO     1579242.0
          WA     1700006.0
          total  5422573.0
Sign up to request clarification or add additional context in comments.

Comments

2

You can reshape by Series.unstack, add new column total and then reshape back by DataFrame.stack, if need MultiIndex use Series.to_frame:

df1 = df.groupby(["office_id", "state"])['sales'].sum().unstack()
df1['total'] = df1.sum(axis=1)
df1 = df1.stack().to_frame('sales')
print (df1)
                     sales
office_id state           
1         CA      505047.0
          CO      724412.0
          total  1229459.0
2         AZ      402775.0
          WA      339803.0
          total   742578.0
3         CA      343655.0
          CO      833474.0
          total  1177129.0
4         AZ      574130.0
          WA      656577.0
          total  1230707.0
5         CA      122260.0
          CO      207717.0
          total   329977.0
6         AZ      262568.0
          WA      504491.0
          total   767059.0

df1 = df.groupby(["office_id", "state"])['sales'].sum().unstack()
df1['total'] = df1.sum(axis=1)
df1 = df1.stack().to_frame('sales')
#cast if sales are always integers
df1.sales = df1.sales.astype(int)
print (df1)
                   sales
office_id state         
1         CA      323107
          CO      658336
          total   981443
2         AZ      273728
          WA      942249
          total  1215977
3         CA      773390
          CO      692275
          total  1465665
4         AZ      669435
          WA      735141
          total  1404576
5         CA      530182
          CO      232104
          total   762286
6         AZ      532248
          WA      951481
          total  1483729

Timings:

def jez(df):
    df1 = df.groupby(["office_id", "state"])['sales'].sum().unstack()
    df1['total'] = df1.sum(axis=1)
    df1 = df1.stack().to_frame('sales')
    df1.sales = df1.sales
    return (df1)

print (jez(df))

In [339]: %timeit (df.pivot_table(index='office_id', columns='state', margins=True, margins_name='total', aggfunc='sum').stack())
100 loops, best of 3: 14.6 ms per loop

In [340]: %timeit (jez(df))
100 loops, best of 3: 2.78 ms per loop

Comments

0

you can also use concat to append the aggregated totals as follows.

pd.concat([df.groupby(["office_id", "state"]).aggregate({"sales": "sum"}),
           df.groupby(["state"]).aggregate({"sales": "sum"})
            .set_index([['Total', 'Total', 'Total', 'Total']], append=True).swaplevel(0, 1)])

which returns

                   sales
office_id state         
1         CA      914776
          CO      902173
2         AZ      605783
          WA      865189
3         CA      280203
          CO      556867
4         AZ      958747
          WA      643333
5         CA      703606
          CO      644399
6         AZ      768268
          WA      834051
Total     AZ     2332798
          CA     1898585
          CO     2103439
          WA     2342573

Here, the Data.frame is aggregated at the office-state and state levels. These are concatenated with .concat. The DataFrame aggregated to the state level must be given an additional index prior to concatnating. This is done with set_index. In addition, the indices must be swapped to conform with the office-state level DataFrame.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.