Aggregating multiple string values in Pandas pivot table

Question

I'm trying to create a calendar that rolls up information across a catalog of projects and organizes it chronologically and by project type. I've been using Pandas and have been unable to get the basic structure right. For example, given this dataset:

           Type      Name   Health Month  Year
0     Marketing  ProjectA       OK   Jan  2018
1       Science  ProjectB  Warning   Apr  2018
2     Marketing  ProjectC       OK   Mar  2018
3   Development  ProjectD       OK   Feb  2018
4     Marketing  ProjectE       OK   Jan  2018
5   Development  ProjectF  Warning   Feb  2018
6   Development  ProjectG  Trouble   May  2018
7     Marketing  ProjectH  Trouble   May  2018
8   Development  ProjectI  Warning   Feb  2018
9     Marketing  ProjectJ       OK   May  2018
10      Science  ProjectK  Warning   Apr  2018

Using the trick shown at Remove none values from dataframe, I can create field to track the rank order of each item within the final table:

df['aggval'] = df['Year'].map(str) + df['Month'] + df['Type']
df['index'] = df.groupby(['aggval']).cumcount()

produces 2 extra columns:

           Type      Name   Health Month  Year              aggval  index
0     Marketing  ProjectA       OK   Jan  2018    2018JanMarketing      0
1       Science  ProjectB  Warning   Apr  2018      2018AprScience      0
2     Marketing  ProjectC       OK   Mar  2018    2018MarMarketing      0
3   Development  ProjectD       OK   Feb  2018  2018FebDevelopment      0
4     Marketing  ProjectE       OK   Jan  2018    2018JanMarketing      1
5   Development  ProjectF  Warning   Feb  2018  2018FebDevelopment      1
6   Development  ProjectG  Trouble   May  2018  2018MayDevelopment      0
7     Marketing  ProjectH  Trouble   May  2018    2018MayMarketing      0
8   Development  ProjectI  Warning   Feb  2018  2018FebDevelopment      2
9     Marketing  ProjectJ       OK   May  2018    2018MayMarketing      1
10      Science  ProjectK  Warning   Apr  2018      2018AprScience      1

With these extract columns, we can now pivot to create an initial version of our project roll up table:

pv1 = pd.pivot_table(df, values='Name', index=['Type', 'index'], columns=['Year', 'Month'], aggfunc=lambda x: "".join(x)).fillna('')
pv1 = pv1.reindex(columns = zip(12 * [2018], ['Jan', 'Feb', 'Mar', 'Apr', 'May']))

to produce the report below. This is basically correct: it collects and lists projects, shows their Names, and organizes them by Type (swimlanes) and chronologically by year and month:

Year                 2018                                          
Month                Jan       Feb       Mar       Apr       May   
Type        index                                                  
Development 0                ProjectD                      ProjectG
            1                ProjectF                              
            2                ProjectI                              
Marketing   0      ProjectA            ProjectC            ProjectH
            1      ProjectE                                ProjectJ
Science     0                                    ProjectB          
            1                                    ProjectK

I'm now stumped in trying to extend this model to display the Name and Health for each project together.

I can add in the Health field as a second pivot table value:

pv2 = pd.pivot_table(df, values=['Name', 'Health'], index=['Type', 'index'], columns=['Year', 'Month'], aggfunc={'Name':lambda x: "|".join(x), 'Health':lambda x: ":".join(x), }).fillna('')
# pv2 = pv2.reindex(columns = zip(10 * [2018], ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar', 'Apr', 'Apr', 'May', 'May'], ['Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name']))

to produce:

                   Health                               Name                                          
Year                2018                                2018                                          
Month               Apr      Feb    Jan Mar   May       Apr       Feb       Jan       Mar       May   
Type        index                                                                                     
Development 0                    OK          Trouble            ProjectD                      ProjectG
            1               Warning                             ProjectF                              
            2               Warning                             ProjectI                              
Marketing   0                        OK  OK  Trouble                      ProjectA  ProjectC  ProjectH
            1                        OK           OK                      ProjectE            ProjectJ
Science     0      Warning                            ProjectB                                        
            1      Warning                            ProjectK

This is the right idea -- both the project Health and Name show up for each project, in the right Month and right Type swimlane, but I'd like them side-by-side by project. Reindexing the columns produces the right result at the header level, but wipes out the cells with Nan values:

pv2 = pd.pivot_table(df, values=['Name', 'Health'], index=['Type', 'index'], columns=['Year', 'Month'], aggfunc={'Name':lambda x: "|".join(x), 'Health':lambda x: ":".join(x), }).fillna('')
pv2 = pv2.reindex(columns = zip(10 * [2018], ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar', 'Apr', 'Apr', 'May', 'May'], ['Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name']))

produces:

                   2018                                                      
Year               Jan         Feb         Mar         Apr         May       
Month             Health Name Health Name Health Name Health Name Health Name
Type        index                                                            
Development 0      NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN 
            1      NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN 
            2      NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN 
Marketing   0      NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN 
            1      NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN 
Science     0      NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN 
            1      NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN

Again, the structure is now correct, but the cell values are no longer showing the project-specific data. What am I missing?

unutbu · Accepted Answer · 2018-03-06 22:47:14Z

pv2 starts out having columns in this order:

In [35]: pv2.columns.tolist()
Out[35]: 
[('Health', 2018, 'Apr'),
 ('Health', 2018, 'Feb'),
 ('Health', 2018, 'Jan'),
 ('Health', 2018, 'Mar'),
 ('Health', 2018, 'May'),
 ('Name', 2018, 'Apr'),
 ('Name', 2018, 'Feb'),
 ('Name', 2018, 'Jan'),
 ('Name', 2018, 'Mar'),
 ('Name', 2018, 'May')]

and we want to rearrange the columns to have this order:

In [36]: list(zip(10 * [2018], ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar', 'Apr', 'Apr', 'May', 'May'], ['Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name']))
Out[36]: 
[(2018, 'Jan', 'Health'),
 (2018, 'Jan', 'Name'),
 (2018, 'Feb', 'Health'),
 (2018, 'Feb', 'Name'),
 (2018, 'Mar', 'Health'),
 (2018, 'Mar', 'Name'),
 (2018, 'Apr', 'Health'),
 (2018, 'Apr', 'Name'),
 (2018, 'May', 'Health'),
 (2018, 'May', 'Name')]

Each column is represented by a 3-tuple. reindex can reorder the list of columns but it can not change the internal order of the items within the 3-tuples. To do that, use reorder_levels:

In [37]: pv2 = pv2.reorder_levels(['Year','Month',0], axis=1)
In [38]: pv2.columns.tolist()
Out[38]: 
[(2018, 'Apr', 'Health'),
 (2018, 'Feb', 'Health'),
 (2018, 'Jan', 'Health'),
 (2018, 'Mar', 'Health'),
 (2018, 'May', 'Health'),
 (2018, 'Apr', 'Name'),
 (2018, 'Feb', 'Name'),
 (2018, 'Jan', 'Name'),
 (2018, 'Mar', 'Name'),
 (2018, 'May', 'Name')]

Once you have the levels in the desired order, you can call reindex to reorder the columns (to get the months in order).

import sys
import pandas as pd
pd.options.display.width = sys.maxsize

df = pd.DataFrame({'Health': ['OK', 'Warning', 'OK', 'OK', 'OK', 'Warning', 'Trouble', 'Trouble', 'Warning', 'OK', 'Warning'], 'Month': ['Jan', 'Apr', 'Mar', 'Feb', 'Jan', 'Feb', 'May', 'May', 'Feb', 'May', 'Apr'], 'Name': ['ProjectA', 'ProjectB', 'ProjectC', 'ProjectD', 'ProjectE', 'ProjectF', 'ProjectG', 'ProjectH', 'ProjectI', 'ProjectJ', 'ProjectK'], 'Type': ['Marketing', 'Science', 'Marketing', 'Development', 'Marketing', 'Development', 'Development', 'Marketing', 'Development', 'Marketing', 'Science'], 'Year': [2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018]})

df['index'] = df.groupby(['Year','Month','Type']).cumcount()

pv2 = pd.pivot_table(df, values=['Name', 'Health'], index=['Type', 'index'], 
                     columns=['Year', 'Month'], 
                     aggfunc={'Name':lambda x: "|".join(x), 
                              'Health':lambda x: ":".join(x), }).fillna('')
pv2 = pv2.reorder_levels(['Year','Month',0], axis=1)
pv2 = pv2.reindex(columns = zip(10 * [2018], ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar', 'Apr', 'Apr', 'May', 'May'], ['Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name']))

print(pv2)

yields

Year                2018                                                                                    
Month                Jan                Feb              Mar                Apr                May          
                  Health      Name   Health      Name Health      Name   Health      Name   Health      Name
Type        index                                                                                           
Development 0                            OK  ProjectD                                      Trouble  ProjectG
            1                       Warning  ProjectF                                                       
            2                       Warning  ProjectI                                                       
Marketing   0         OK  ProjectA                        OK  ProjectC                     Trouble  ProjectH
            1         OK  ProjectE                                                              OK  ProjectJ
Science     0                                                           Warning  ProjectB                   
            1                                                           Warning  ProjectK

Although sometimes you may need to manually specify the desired order of the columns, this is not (necessarily) one of those cases. The order you desire is the natural date order. So it would be to our advantage to parse the Year and Month labels into actual dates (of dtype datetime64[ns]). This unlocks Pandas' intelligent datetime-handling behavior.

For example, pivot_table will sort the dates for us automatically if we pivot on a date column (i.e. a column of dtype datetime64[ns]).
Moreover, we can conveniently generate all the calendar months in order without any manual typing of dates:
```
dates = pd.date_range('2018-01-01', '2018-12-31', freq='MS')
```
And we can convert a DatetimeIndex to a 2-level MultiIndex Year/Month format (for presentation purposes) easily as well:
```
pv2.index = pd.Index(pv2.index.strftime('%Y-%b')).str.split('-', expand=True)
```

For example,

import sys
import pandas as pd
pd.options.display.width = sys.maxsize

df = pd.DataFrame({'Health': ['OK', 'Warning', 'OK', 'OK', 'OK', 'Warning', 'Trouble', 'Trouble', 'Warning', 'OK', 'Warning'], 'Month': ['Jan', 'Apr', 'Mar', 'Feb', 'Jan', 'Feb', 'May', 'May', 'Feb', 'May', 'Apr'], 'Name': ['ProjectA', 'ProjectB', 'ProjectC', 'ProjectD', 'ProjectE', 'ProjectF', 'ProjectG', 'ProjectH', 'ProjectI', 'ProjectJ', 'ProjectK'], 'Type': ['Marketing', 'Science', 'Marketing', 'Development', 'Marketing', 'Development', 'Development', 'Marketing', 'Development', 'Marketing', 'Science'], 'Year': [2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018]})

df['Date'] = pd.to_datetime(df['Year'].astype('str')+df['Month'], format='%Y%b')
df['index'] = df.groupby(['Date','Type']).cumcount()

pv2 = pd.pivot_table(df, values=['Name', 'Health'], columns=['Type', 'index'], 
                     index=['Date'], 
                     aggfunc={'Name':lambda x: "|".join(x), 
                              'Health':lambda x: ":".join(x), }).fillna('')

dates = pd.date_range('2018-01-01', '2018-12-31', freq='MS')
pv2 = pv2.reindex(dates, fill_value='')
pv2.index = pd.Index(pv2.index.strftime('%Y-%b')).str.split('-', expand=True)
pv2 = pv2.stack(0)
pv2 = pv2.T
print(pv2)

yields

                    2018                                                                                     ...                                                             
                     Jan                Feb              Mar                Apr                May           ...     Aug         Sep         Oct         Nov         Dec     
                  Health      Name   Health      Name Health      Name   Health      Name   Health      Name ...  Health Name Health Name Health Name Health Name Health Name
Type        index                                                                                            ...                                                             
Development 0                            OK  ProjectD                                      Trouble  ProjectG ...                                                             
            1                       Warning  ProjectF                                                        ...                                                             
            2                       Warning  ProjectI                                                        ...                                                             
Marketing   0         OK  ProjectA                        OK  ProjectC                     Trouble  ProjectH ...                                                             
            1         OK  ProjectE                                                              OK  ProjectJ ...                                                             
Science     0                                                           Warning  ProjectB                    ...                                                             
            1                                                           Warning  ProjectK                    ...

Thanks @unutbu this is close but it still has all the "Health" columns together and all the "Name" columns together. I'm looking to see a Health and Name column under each month, so the project name and status are shown next to each other. I imagine this would require each month heading to span 2 columns (one for Name, one for Type)
Sorry about that; I fell in love with MultiIndex.from_product and forgot what order you were really asking for. The code above now uses your desired order.
regarding your second explanation on column sorting, would your approach still guarantee column presence? That is, ensure that there is a Jan, Feb, Mar, etc column even if there are no values there? Recall that the objective is to create a sort of calendar view, so I thought manual specification of the columns was the best way to ensure each month is represented, irrespective of the actual record contents
In that case, I think I would convert the Year and Month columns to a single datetime column first. Do all the calculations on actual datetimes, and then convert back to the Year/Month format for presentation purposes only at the end. I've edited the post to show what I mean.

BENY · Accepted Answer · 2018-03-05 04:27:23Z

2

IIUC , you just need swaplevel and sort_index

#pv2 = pd.pivot_table(df, values=['Name', 'Health'], index=['Type', 'index'], columns=['Year', 'Month'], aggfunc={'Name':lambda x: "|".join(x), 'Health':lambda x: ":".join(x), }).fillna('')

pv2.swaplevel(0,1,axis=1).swaplevel(1,2,axis=1).sort_index(axis=1)

Out[220]: 
Year                  2018                                                \
Month                  Apr                Feb              Jan             
                    Health      Name   Health      Name Health      Name   
Type        index                                                          
Development 0                              OK  ProjectD                    
            1                         Warning  ProjectF                    
            2                         Warning  ProjectI                    
Marketing   0                                               OK  ProjectA   
            1                                               OK  ProjectE   
Science     0      Warning  ProjectB                                       
            1      Warning  ProjectK                                       
Year                                                   
Month                Mar                May            
                  Health      Name   Health      Name  
Type        index                                      
Development 0                       Trouble  ProjectG  
            1                                          
            2                                          
Marketing   0         OK  ProjectC  Trouble  ProjectH  
            1                            OK  ProjectJ  
Science     0                                          
            1                                          

#pv2.swaplevel(0,1,axis=1).swaplevel(1,2,axis=1).sort_index(axis=1).to_excel('aaaaaa.xlsx')

edited Mar 5, 2018 at 4:27

answered Mar 5, 2018 at 2:13

BENY

324k22 gold badges176 silver badges250 bronze badges

6 Comments

Ramon Over a year ago

Thanks @Wen. Your output is what I'm looking for, but I'm unable to reproduce it just with the swaplevel code you produced. Could you please edit your answer to include the upstream steps that include the pivot table creation and anything in between?

Ramon Over a year ago

Thanks that fixed it for me. What does the sort_index do at the end? Can that be used to reorder the columns (e.g. make it Name,Health instead of Health,Name)?

Ramon Over a year ago

Also can you confirm that you see this same formatting if you export to Excel? Weirdly, when I call to_excel I then see the Health and Name columns ungrouped again in the resulting Excel file

BENY Over a year ago

@Ramon sort index, is to sort the index , not reorder the index , it will re arrange the order within each single level of index

BENY Over a year ago

@Ramon I will update a picture after I write into excel

|

Collectives™ on Stack Overflow

Aggregating multiple string values in Pandas pivot table

2 Answers 2

5 Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related