1

Input code:

import pandas as pd
import numpy as np

#Dummy df:
df = pd.DataFrame({'Name': ['John', 'Boby', 'Mina', 'Peter',
'Nicky','Peter','Mina','Peter'],
           'City': ['London','NY','LA','London','NY','HK','NY','HK'],

'Stage': ['Masters', 'Graduate', 'Graduate', 'Masters',
'Graduate','Masters','Graduate','Graduate'],
'Year':[2020,2019,2020,2019,2020,2019,2020,2020],
'Month':[202001,201902,202003,201904,202005,201902,202007,202012],
'Earnings': [27, 23, 21, 66, 24,22,34,65]})

df_pivot=pd.pivot_table(df,values = 'Earnings', index=
['Name','City','Stage'], columns = ['Year','Month'], aggfunc=np.sum,
fill_value=0, margins = True).sort_values('All', ascending=False)
print(df_pivot)

Output pivot table:

Year                    2019          2020                              
All
Month                 201902 201904 202001 202003 202005 202007 202012     
Name  City   Stage                                                         
All                       45     66     27     21     24     34     65  282
Peter London Masters       0     66      0      0      0      0      0   66
      HK     Graduate      0      0      0      0      0      0     65   65
Mina  NY     Graduate      0      0      0      0      0     34      0   34
John  London Masters       0      0     27      0      0      0      0   27
Nicky NY     Graduate      0      0      0      0     24      0      0   24
Boby  NY     Graduate     23      0      0      0      0      0      0   23
Peter HK     Masters      22      0      0      0      0      0      0   22
Mina  LA     Graduate      0      0      0     21      0      0      0   21

Desired output sorted firstly by first column, then within the group by second column and lastly within the group by 3rd column:

Year                    2019          2020                              All
Month                 201902 201904 202001 202003 202005 202007 202012     
Name  City   Stage                                                         
All                       45     66     27     21     24     34     65  282
Peter HK     Graduate      0      0      0      0      0      0     65   65
             Masters      22      0      0      0      0      0      0   22
      London Masters       0     66      0      0      0      0      0   66
Mina  NY     Graduate      0      0      0      0      0     34      0   34
      LA     Graduate      0      0      0     21      0      0      0   21
John  London Masters       0      0     27      0      0      0      0   27
Nicky NY     Graduate      0      0      0      0     24      0      0   24
Boby  NY     Graduate     23      0      0      0      0      0      0   23

Please note how Peter-HK is higher than Peter-London, because sum of Peter-HK (65+22) > sum of Peter-London (66).

In other words: First give me Name with biggest total, then within that name give me City with Biggest total, then within that Name and that City give me Stage with biggest total.

Thank you pawel

9
  • Not certain what the final result should look like. Have you tried to sort again after sorting by 'All'. Like this: df_pivot.sort_values('All', ascending=False).sort_index() Commented Feb 24, 2021 at 15:17
  • Hi Pawel, could you show what the output should look like? Commented Feb 24, 2021 at 15:24
  • Hello and thank you for quick respond! The end result - I have attached as screen shot from excel. In short words I want to sort first column by "All", then second column by "All" and third column by "All". it would mean that for end result "Peter" is on top as All is (60+23), then for Peter in column City I want to have first HK as its value is 60 and then London with value 23. Does it make sense? Can you look on attached screen shoot, as I am unable to paste text, no idea why.. Thank you! Commented Feb 24, 2021 at 15:25
  • chain a sort index? df_pivot.sort_values(by="All", ascending=False).sort_index() ? Commented Feb 24, 2021 at 15:27
  • I have updated post - in bottom I wrote the expected result. Commented Feb 24, 2021 at 15:30

2 Answers 2

2

Edit after understanding the question even better.

You want to sort on maximal score obtained by a person (defined by Name). Then within that person you want to sort on the individual scores obtained by that person.

In your example, I can get the list with the desired sequence of Name in this way:

import pandas as pd
import numpy as np

#Dummy df:
df = pd.DataFrame({'Name': ['John', 'Boby', 'Mina', 'Peter', 
'Nicky','Peter','Mina','Peter'],
               'City': ['London','NY','LA','London','NY','HK','NY','HK'],

  'Stage': ['Masters', 'Graduate', 'Graduate', 'Masters', 
  'Graduate','Masters','Graduate','Graduate'],
  'Year':[2020,2019,2020,2019,2020,2019,2020,2020],
  'Month':[202001,201902,202003,201904,202005,201902,202007,202012],
  'Earnings': [27, 23, 21, 23, 24,22,34,65]})

# Make the pivot table
df_pivot=pd.pivot_table(df,values = 'Earnings', index= 
  ['Name','City','Stage'], columns = ['Year','Month'], aggfunc=np.sum, 
  fill_value=0, margins = True).sort_values('All', ascending=False)
print('Original table')
print(df_pivot)

def sort_groups(df, group_by_col, sort_by_col, F_asc):
    """Sort a dataframe by a certain level of the MultiIndex

    Args:
        df (pd.DataFrame): Dataframe to sort
        group_by_col (str): name of the index level to sort by
        sort_by_col (str): name of the value column to sort by
        F_asc (bool): Ascending sort - True/False

    Returns:
        pd.Dataframe: Dataframe sorted on given multiindex level
    """

    # Make a list of the desired index sequence based on the max value found in each group
    ind = df.groupby(by=group_by_col).max().sort_values(sort_by_col, ascending=F_asc).index.to_list()

    # Return re-indexed dataframe
    return df.reindex(ind, level=df.index.names.index(group_by_col))

# First level sorting: Name
df_pivot_1 = sort_groups(df_pivot, 'Name', 'All', False)
print('\nSort groups at name level:')
print(df_pivot_1)

# Second level sorting : City
#df_pivot_2 = df_pivot_1.groupby(by='Name').apply(lambda x : sort_groups(x, 'City', 'All', False))
df_pivot_2 =pd.concat([sort_groups(group, 'City', 'All', False) for index, group in df_pivot_1.groupby(by=['Name'])])
print('\nSort groups at city level:')
print(df_pivot_2)

# Third level sorting : Stage
df_pivot_3 = df_pivot_2.groupby(by = ['Name', 'City']).apply(lambda x : sort_groups(x, 'Stage', 'All', False))
print('\nSort groups at stage level:')
print(df_pivot_3)

This solution does not place the All row where you indicate it though. Is this very stringent for you?

regards,

Jan

Sign up to request clarification or add additional context in comments.

11 Comments

Hi Jan, I tried your code, but it made only my "All" disappear, no sorting was applied. Can you check bottom of my post where I posted what I am expecting in the end. Thank you
Thanks for that, now it is clear for me what you intended to obtain. I think the edit of my original post reflects this.
thank you for your response, I have edited my post as with different values your proposal does not work. I will be grateful for help!
In the solution I posted yesterday, you could just move the All column manually I guess.
Thank you for your another response, it works better, but why I receive duplicated columns in this case? When I run same code as you posted I have twice column "name" in my table, if I expand sorting further (as I have df with more indexed columns, then all of them are duplicated). Can you run your code and you will see column name is duplicated? Thank you
|
0

here is an super clean way to combine a groupby with a pivot

  df = pd.DataFrame({'Name': ['John', 'Boby', 'Mina', 'Peter', 
  'Nicky','Peter','Mina','Peter'],
           'City': ['London','NY','LA','London','NY','HK','NY','HK'],

 'Stage': ['Masters', 'Graduate', 'Graduate', 'Masters', 
 'Graduate','Masters','Graduate','Graduate'],
 'Year':[2020,2019,2020,2019,2020,2019,2020,2020],
 'Month':[202001,201902,202003,201904,202005,201902,202007,202012],
 'Earnings': [27, 23, 21, 23, 24,22,34,65]})

grouped=df.groupby(['Name','City','Stage','Year','Month'])['Earnings'].sum()
#print(grouped)
grouped=grouped.reset_index(name='Sum')
fp=grouped.pivot(index=['Name','City','Stage'],columns=['Year','Month'],values='Sum').fillna(0)
fp['Totals'] = fp.sum(axis='columns')
fp["Rank"] = fp.groupby(['Name','City'])['Totals'].sum()

fp = fp.sort_values(by=['Name','Rank','City','Totals'],ascending=[False,False,False,False])

print(fp)

28 Comments

OP specifically requested the Name column and City column to stay grouped.
if you sort by totals than that rule conflicts. How would you resolve the issue
@Golden Lion, I can see in your output that for Peter, City "London" is higher than HK, but sum for HK is higher than one for London, so it should be higher... So first I want to sort by name, then for each name I want to sort city and then for each city under specific name I want to sort stage,
I added a sort Boolean list. you can change the sort order for each column. This should give you the results you want.
@Golden Lion, I tried your code but it still doesn't work. Sorting for whole "Peter" works, but if you look at "Mina" Mina - LA with value 21 is higher than MINA - NY with value 34... Will appreciate your help.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.