Python Pandas: Create dataframe from Excel file with multi (merged cell) headers

Question

I am relatively new to Python (Pandas) which I would like to use for automating Excel tasks and be more efficient at my work :)

Currently I am sitting in front of below Excel sales report where the "year" is a merged cell.

           |               2018                          |              2019                      |
| Product  |  January  |  February  |  March  |  April   |  January  |  February |  March | April |
| A        |        8  |        10  |     65  |     50   |     8     |     10    |   65   |    50 |
| B        |        9  |        10  |     65  |     50   |     8     |     63    |   65   |    50 |     
| C        |        7  |        10  |     65  |     50   |     8     |     10    |   65   |    50 |
| D        |        8  |        10  |     65  |     50   |     8     |     10    |   65   |    50 |

Now I would like to reshape the report into a stacked format, which I can then write back to Excel, and use for further analysis:

Product  |  Year  |  Month  |  Values
A        |   2018 | January |       8    
B        |   2018 | February|       9

My idea was to create a dataframe and use pd.melt()

Unfortunately I fail already at the very first step when trying to create the dataframe.

The "Year" is only written in 2 cells while the rest shows "unnamed x".

import pandas as pd

// change console output
desired_width = 320
pd.set_option("display.width", desired_width)
pd.set_option("display.max_columns", 30)

//Read Excel file and create dataframe

df = pd.read_excel("Stackoverflow_example.xlsx")

print(df)




  Unnamed: 0     2018 Unnamed: 2 Unnamed: 3 Unnamed: 4     2019 Unnamed: 6 Unnamed: 7 Unnamed: 8
0    Product  January   February      March      April  January   February      March      April
1          A        8         10         65         50        8         10         65         50
2          B        9         10         65         50        8         63         65         50
3          C        7         10         65         50        8         10         65         50
4          D        8         10         65         50        8         10         65         50

It would be great if someone could help me on this problem.

Many thanks in advance.

Edit:

Adding header=[0,1], index_col=[0] worked, but I am still struggling to find a way to convert it into a stacked format.....

import pandas as pd

desired_width = 320
pd.set_option("display.width", desired_width)
pd.set_option("display.max_columns", 30)

df = pd.read_excel("Stackoverflow_example.xlsx", header=[0,1], index_col=[0])

print(df)

----------------------------------------------------------------------

           2018                         2019                     
Product January February March April January February March April
A             8       10    65    50       8       10    65    50
B             9       10    65    50       8       63    65    50
C             7       10    65    50       8       10    65    50
D             8       10    65    50       8       10    65    50

It worked, but messed up the column header names at the same time (level_0, "Product" is in the "month" column...


import pandas as pd

desired_width = 320
pd.set_option("display.width", desired_width)
pd.set_option("display.max_columns", 30)

df = pd.read_excel("Stackoverflow_example.xlsx", header=[0,1], index_col=[0])
df = df.stack().reset_index()

print(df)

-----------------------------------------------------------------------------
   level_0   Product  2018  2019
0        A     April    50    50
1        A  February    10    10
2        A   January     8     8
3        A     March    65    65
4        B     April    50    50
5        B  February    10    63
6        B   January     9     8
7        B     March    65    65
8        C     April    50    50
9        C  February    10    10
10       C   January     7     8
11       C     March    65    65
12       D     April    50    50
13       D  February    10    10
14       D   January     8     8
15       D     March    65    65

I tried to rename the columns and set the index to "Product", resulting in empty "cells" below "Month 2018 2019"

import pandas as pd

desired_width = 320
pd.set_option("display.width", desired_width)
pd.set_option("display.max_columns", 30)

df = pd.read_excel("Stackoverflow_example.xlsx", header=[0,1], index_col=[0])
df = df.stack().reset_index()

df.columns = ["Product", "Month", "2018", "2019"]
df = df.set_index("Product")

print(df)

----------------------------------------------------------

           Month  2018  2019
Product                      
A           April    50    50
A        February    10    10
A         January     8     8
A           March    65    65
B           April    50    50
B        February    10    63
B         January     9     8
B           March    65    65
C           April    50    50
C        February    10    10
C         January     7     8
C           March    65    65
D           April    50    50
D        February    10    10
D         January     8     8
D           March    65    65

Thanks, jezrael - that worked, but I am still struggling to convert it into stacked format :/ — SebK
– SebK, Commented Nov 18, 2019 at 13:55
worked with this sort of data before (mainly SAP BW!!) let me know if my answer helped at all. — Umar.H
– Umar.H, Commented Nov 18, 2019 at 14:44

jezrael · Accepted Answer · 2019-11-18 15:02:39Z

1

First for MultiIndex in columns add parameter header=[0,1] and for avoid MultiIndex by first column add index_col=[0] for convert firt column to index:

df = pd.read_excel("Stackoverflow_example.xlsx", header=[0,1], index_col=[0])

Then reshape by DataFrame.unstack, change index names by Series.rename_axis and last convert Series to columns by Series.reset_index:

df = df.unstack().rename_axis(('Year','Month','Product')).reset_index(name='Value')

#if order of columns is impiortant change it by subset
df = df[['Product','Year','Month','Value']]
print(df.head())

  Product  Year     Month  Value
0       A  2018   January      8
1       B  2018   January      9
2       C  2018   January      7
3       D  2018   January      8
4       A  2018  February     10

edited Nov 18, 2019 at 15:02

answered Nov 18, 2019 at 13:54

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Umar.H · Accepted Answer · 2019-11-18 15:11:58Z

One way is using pd.MultiIndex, stack and melt

print(df)
    Unnamed:_0     2018 Unnamed:_2 Unnamed:_3 Unnamed:_4     2019 Unnamed:_6  \
0    Product  January   February      March      April  January   February   
1          A        8         10         65         50        8         10   
2          B        9         10         65         50        8         63   
3          C        7         10         65         50        8         10   
4          D        8         10         65         50        8         10   

  Unnamed:_7 Unnamed:_8  
0      March      April  
1         65         50  
2         65         50  
3         65         50  
4         65         50

first we need to rename the columns from Unnamed and set the column Product name

df.columns = pd.Series([np.nan if 'Unnamed:' in x else x for x in df.columns.values]).ffill().values.flatten()

as we used ffill the first column will be nan lets call this Product and set it as the index.

df.rename(columns={np.nan : 'Product'},inplace=True)
df.set_index('Product',inplace=True)

and lets create our multi index from the new columns :

print(df)
        2018      2018   2018   2018     2019      2019   2019   2019
Product                                                                  
Product  January  February  March  April  January  February  March  April
A              8        10     65     50        8        10     65     50
B              9        10     65     50        8        63     65     50
C              7        10     65     50        8        10     65     50
D              8        10     65     50        8        10     65     50


df.columns = pd.MultiIndex.from_arrays([df.columns,df.iloc[0].values])
df_new = df.iloc[1:].stack().reset_index().melt(id_vars=['Product','level_1'])
print(df_new)
    Product   level_1 variable value
0        A     April     2018    50
1        A  February     2018    10
2        A   January     2018     8
3        A     March     2018    65
4        B     April     2018    50
5        B  February     2018    10
6        B   January     2018     9
7        B     March     2018    65
8        C     April     2018    50
9        C  February     2018    10
10       C   January     2018     7
11       C     March     2018    65
12       D     April     2018    50
13       D  February     2018    10
14       D   January     2018     8
15       D     March     2018    65
16       A     April     2019    50
17       A  February     2019    10
18       A   January     2019     8
19       A     March     2019    65
20       B     April     2019    50
21       B  February     2019    63
22       B   January     2019     8
23       B     March     2019    65
24       C     April     2019    50
25       C  February     2019    10
26       C   January     2019     8
27       C     March     2019    65
28       D     April     2019    50
29       D  February     2019    10
30       D   January     2019     8
31       D     March     2019    65

Collectives™ on Stack Overflow

Python Pandas: Create dataframe from Excel file with multi (merged cell) headers

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related