2

I am relatively new to Python (Pandas) which I would like to use for automating Excel tasks and be more efficient at my work :)

Currently I am sitting in front of below Excel sales report where the "year" is a merged cell.

           |               2018                          |              2019                      |
| Product  |  January  |  February  |  March  |  April   |  January  |  February |  March | April |
| A        |        8  |        10  |     65  |     50   |     8     |     10    |   65   |    50 |
| B        |        9  |        10  |     65  |     50   |     8     |     63    |   65   |    50 |     
| C        |        7  |        10  |     65  |     50   |     8     |     10    |   65   |    50 |
| D        |        8  |        10  |     65  |     50   |     8     |     10    |   65   |    50 |

Now I would like to reshape the report into a stacked format, which I can then write back to Excel, and use for further analysis:

Product  |  Year  |  Month  |  Values
A        |   2018 | January |       8    
B        |   2018 | February|       9

My idea was to create a dataframe and use pd.melt()

Unfortunately I fail already at the very first step when trying to create the dataframe.

The "Year" is only written in 2 cells while the rest shows "unnamed x".

import pandas as pd

// change console output
desired_width = 320
pd.set_option("display.width", desired_width)
pd.set_option("display.max_columns", 30)

//Read Excel file and create dataframe

df = pd.read_excel("Stackoverflow_example.xlsx")

print(df)




  Unnamed: 0     2018 Unnamed: 2 Unnamed: 3 Unnamed: 4     2019 Unnamed: 6 Unnamed: 7 Unnamed: 8
0    Product  January   February      March      April  January   February      March      April
1          A        8         10         65         50        8         10         65         50
2          B        9         10         65         50        8         63         65         50
3          C        7         10         65         50        8         10         65         50
4          D        8         10         65         50        8         10         65         50

It would be great if someone could help me on this problem.

Many thanks in advance.

Edit:

Adding header=[0,1], index_col=[0] worked, but I am still struggling to find a way to convert it into a stacked format.....

import pandas as pd

desired_width = 320
pd.set_option("display.width", desired_width)
pd.set_option("display.max_columns", 30)

df = pd.read_excel("Stackoverflow_example.xlsx", header=[0,1], index_col=[0])

print(df)

----------------------------------------------------------------------

           2018                         2019                     
Product January February March April January February March April
A             8       10    65    50       8       10    65    50
B             9       10    65    50       8       63    65    50
C             7       10    65    50       8       10    65    50
D             8       10    65    50       8       10    65    50

It worked, but messed up the column header names at the same time (level_0, "Product" is in the "month" column...


import pandas as pd

desired_width = 320
pd.set_option("display.width", desired_width)
pd.set_option("display.max_columns", 30)

df = pd.read_excel("Stackoverflow_example.xlsx", header=[0,1], index_col=[0])
df = df.stack().reset_index()

print(df)

-----------------------------------------------------------------------------
   level_0   Product  2018  2019
0        A     April    50    50
1        A  February    10    10
2        A   January     8     8
3        A     March    65    65
4        B     April    50    50
5        B  February    10    63
6        B   January     9     8
7        B     March    65    65
8        C     April    50    50
9        C  February    10    10
10       C   January     7     8
11       C     March    65    65
12       D     April    50    50
13       D  February    10    10
14       D   January     8     8
15       D     March    65    65

I tried to rename the columns and set the index to "Product", resulting in empty "cells" below "Month 2018 2019"

import pandas as pd

desired_width = 320
pd.set_option("display.width", desired_width)
pd.set_option("display.max_columns", 30)

df = pd.read_excel("Stackoverflow_example.xlsx", header=[0,1], index_col=[0])
df = df.stack().reset_index()

df.columns = ["Product", "Month", "2018", "2019"]
df = df.set_index("Product")

print(df)

----------------------------------------------------------

           Month  2018  2019
Product                      
A           April    50    50
A        February    10    10
A         January     8     8
A           March    65    65
B           April    50    50
B        February    10    63
B         January     9     8
B           March    65    65
C           April    50    50
C        February    10    10
C         January     7     8
C           March    65    65
D           April    50    50
D        February    10    10
D         January     8     8
D           March    65    65

5
  • Thanks, jezrael - that worked, but I am still struggling to convert it into stacked format :/ Commented Nov 18, 2019 at 13:55
  • Can you check answer? Commented Nov 18, 2019 at 13:59
  • worked with this sort of data before (mainly SAP BW!!) let me know if my answer helped at all. Commented Nov 18, 2019 at 14:44
  • @SebK - oops, there was necessary unstack, answer edited. Commented Nov 18, 2019 at 14:57
  • Thanks a lot, guys! Both solutions work just fine :) Commented Nov 19, 2019 at 13:40

2 Answers 2

1

First for MultiIndex in columns add parameter header=[0,1] and for avoid MultiIndex by first column add index_col=[0] for convert firt column to index:

df = pd.read_excel("Stackoverflow_example.xlsx", header=[0,1], index_col=[0])

Then reshape by DataFrame.unstack, change index names by Series.rename_axis and last convert Series to columns by Series.reset_index:

df = df.unstack().rename_axis(('Year','Month','Product')).reset_index(name='Value')

#if order of columns is impiortant change it by subset
df = df[['Product','Year','Month','Value']]
print(df.head())

  Product  Year     Month  Value
0       A  2018   January      8
1       B  2018   January      9
2       C  2018   January      7
3       D  2018   January      8
4       A  2018  February     10
Sign up to request clarification or add additional context in comments.

Comments

0

One way is using pd.MultiIndex, stack and melt

print(df)
    Unnamed:_0     2018 Unnamed:_2 Unnamed:_3 Unnamed:_4     2019 Unnamed:_6  \
0    Product  January   February      March      April  January   February   
1          A        8         10         65         50        8         10   
2          B        9         10         65         50        8         63   
3          C        7         10         65         50        8         10   
4          D        8         10         65         50        8         10   

  Unnamed:_7 Unnamed:_8  
0      March      April  
1         65         50  
2         65         50  
3         65         50  
4         65         50  

first we need to rename the columns from Unnamed and set the column Product name

df.columns = pd.Series([np.nan if 'Unnamed:' in x else x for x in df.columns.values]).ffill().values.flatten()

as we used ffill the first column will be nan lets call this Product and set it as the index.

df.rename(columns={np.nan : 'Product'},inplace=True)
df.set_index('Product',inplace=True)

and lets create our multi index from the new columns :

print(df)
        2018      2018   2018   2018     2019      2019   2019   2019
Product                                                                  
Product  January  February  March  April  January  February  March  April
A              8        10     65     50        8        10     65     50
B              9        10     65     50        8        63     65     50
C              7        10     65     50        8        10     65     50
D              8        10     65     50        8        10     65     50


df.columns = pd.MultiIndex.from_arrays([df.columns,df.iloc[0].values])
df_new = df.iloc[1:].stack().reset_index().melt(id_vars=['Product','level_1'])
print(df_new)
    Product   level_1 variable value
0        A     April     2018    50
1        A  February     2018    10
2        A   January     2018     8
3        A     March     2018    65
4        B     April     2018    50
5        B  February     2018    10
6        B   January     2018     9
7        B     March     2018    65
8        C     April     2018    50
9        C  February     2018    10
10       C   January     2018     7
11       C     March     2018    65
12       D     April     2018    50
13       D  February     2018    10
14       D   January     2018     8
15       D     March     2018    65
16       A     April     2019    50
17       A  February     2019    10
18       A   January     2019     8
19       A     March     2019    65
20       B     April     2019    50
21       B  February     2019    63
22       B   January     2019     8
23       B     March     2019    65
24       C     April     2019    50
25       C  February     2019    10
26       C   January     2019     8
27       C     March     2019    65
28       D     April     2019    50
29       D  February     2019    10
30       D   January     2019     8
31       D     March     2019    65

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.