Reading partially multi-indexed Excel files in Python

Question

I currently have an excel file that looks like this where the first 4 rows are a part of the header:

What I am looking for is to read it using python while maintaining the multi-indexes on the columns.

Currently, I use the following code to read the excel using pandas:

df=pd.read_excel("./Demo Xls.xlsx",header=[0, 1, 2, 3] )

def clean_multi_index(column: Tuple) -> Tuple:
    return tuple([x for x in column if "Unnamed:" not in x])

cleaned_cols = [clean_multi_index(x) for x in df.columns] 
cleaned_cols

However, the columns printed are:

[('Header', 'A', 'X1'),
 ('Header', 'Col1', 'X2'),
 ('Header', 'Col1', 'X3'),
 ('Header', 'Col2', 'C', 'X4'),
 ('Header', 'Col2', 'B', 'X5'),
 ('Header', 'Col2', 'B', 'X6')]

As you can see, it seems to make some assumptions about how to fill in the higher level columns when they are empty.

For example, X6 is not a part of the merged Col2 and B levels The result I am expecting is something like:

[('Header', 'A', 'X1'),
 ('Header', 'Col1', 'X2'),
 ('Header', 'X3'),
 ('Header', 'Col2', 'C', 'X4'),
 ('Header', 'Col2', 'B', 'X5'),
 ('Header', 'X6')]

Almost all the examples I have seen online that talk about reading excel in python cover either merged rows or cases where the higher level rows are always merged. (i.e, when one merge ends, another begins)

Unfortunately, I cannot modify the data to improve its structure and need to work with the xls file as-is.

jezrael · Accepted Answer · 2021-07-14 06:05:23Z

1

Both your ouputs cannot be converted to MultiIndex, because length of all tuples has to be same.

So possible solution is replace Unnamed to empty string, so tuples are:

def clean_multi_index(column: Tuple) -> Tuple:
    return tuple(['' if "Unnamed:" in x else x for x in column])


[('Header', '', 'A', 'X1'),
 ('Header', 'Col1', '', 'X2'),
 ('Header', 'Col1', '', 'X3'),
 ('Header', 'Col2', 'C', 'X4'),
 ('Header', 'Col2', 'B', 'X5'),
 ('Header', 'Col2', 'B', 'X6')]

Idea is read header like DataFrame, processing ffill like need, e.g. first row, then read again with omit first 4 rows and assign to df.columns:

df1 = pd.read_excel('Demo Xls.xlsx',header=None, nrows=4)
df1.iloc[0] = df1.iloc[0].ffill()
print (df1)
        0       1       2       3       4       5
0  Header  Header  Header  Header  Header  Header
1     NaN    Col1     NaN    Col2     NaN     NaN
2       A     NaN     NaN       C       B     NaN
3      X1      X2      X3      X4      X5      X6

df = pd.read_excel('Demo Xls.xlsx',header=None, skiprows=4)
df.columns = pd.MultiIndex.from_frame(df1.T)
df = df.rename_axis([None, None, None, None], axis=1)
print (df)
  Header                         
     NaN Col1  NaN Col2  NaN     
       A  NaN  NaN    C    B  NaN
      X1   X2   X3   X4   X5   X6
0    v11  v12  v13  v14  v15  v16
1    v21  v22  v23  v24  v25  v26
2    v31  v32  v33  v34  v35  v36

edited Jul 14, 2021 at 6:05

answered Jul 14, 2021 at 4:57

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Pawan Bhandarkar Over a year ago

Thank you. But I am still having trouble reading the columns right. Even if I were to make them all the same length, I would have expected the last col to be ('Header', '', '', 'X6') But pandas adds Col2 and B by default and I don't understand why

jezrael Over a year ago

@PawanBhandarkar - Because in excel columns are merged, so pandas merged columns read together - from your input for Col2 are merged last 3 columns, so in ouput are in second level 3 times Col2

Pawan Bhandarkar Over a year ago

I am a little confused. From the above, you can see that only the cells above B and C are merged. So the cells above X6 are not merged, right? (second level, last column) I have added a link to the spreadsheet above the image, if it helps.

jezrael Over a year ago

@PawanBhandarkar - I see, then reason should be pandas guess there is last column 'merged'. In another words pandas read nice MultiIndex correct, if not nice , it means some levels omittred, guess how should looks.

Pawan Bhandarkar Over a year ago

Yeah...I think so too. It's a little annoying because it seems to default to something like "ffil" that we see in the fillna() method. I cannot find anything in the docs to disable it.

|

Collectives™ on Stack Overflow

Reading partially multi-indexed Excel files in Python

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related