0

I currently have an excel file that looks like this where the first 4 rows are a part of the header:

https://docs.google.com/spreadsheets/d/1t6GthmsBADTExk6LhQDR7nKRlZIdDiPYLY3_Ki21GTM/edit?usp=sharing (link to spreadsheet)

What I am looking for is to read it using python while maintaining the multi-indexes on the columns.

Currently, I use the following code to read the excel using pandas:

df=pd.read_excel("./Demo Xls.xlsx",header=[0, 1, 2, 3] )

def clean_multi_index(column: Tuple) -> Tuple:
    return tuple([x for x in column if "Unnamed:" not in x])

cleaned_cols = [clean_multi_index(x) for x in df.columns] 
cleaned_cols 

However, the columns printed are:

[('Header', 'A', 'X1'),
 ('Header', 'Col1', 'X2'),
 ('Header', 'Col1', 'X3'),
 ('Header', 'Col2', 'C', 'X4'),
 ('Header', 'Col2', 'B', 'X5'),
 ('Header', 'Col2', 'B', 'X6')]

As you can see, it seems to make some assumptions about how to fill in the higher level columns when they are empty.

For example, X6 is not a part of the merged Col2 and B levels The result I am expecting is something like:

[('Header', 'A', 'X1'),
 ('Header', 'Col1', 'X2'),
 ('Header', 'X3'),
 ('Header', 'Col2', 'C', 'X4'),
 ('Header', 'Col2', 'B', 'X5'),
 ('Header', 'X6')]

Almost all the examples I have seen online that talk about reading excel in python cover either merged rows or cases where the higher level rows are always merged. (i.e, when one merge ends, another begins)

Unfortunately, I cannot modify the data to improve its structure and need to work with the xls file as-is.

1 Answer 1

1

Both your ouputs cannot be converted to MultiIndex, because length of all tuples has to be same.

So possible solution is replace Unnamed to empty string, so tuples are:

def clean_multi_index(column: Tuple) -> Tuple:
    return tuple(['' if "Unnamed:" in x else x for x in column])


[('Header', '', 'A', 'X1'),
 ('Header', 'Col1', '', 'X2'),
 ('Header', 'Col1', '', 'X3'),
 ('Header', 'Col2', 'C', 'X4'),
 ('Header', 'Col2', 'B', 'X5'),
 ('Header', 'Col2', 'B', 'X6')]

Idea is read header like DataFrame, processing ffill like need, e.g. first row, then read again with omit first 4 rows and assign to df.columns:

df1 = pd.read_excel('Demo Xls.xlsx',header=None, nrows=4)
df1.iloc[0] = df1.iloc[0].ffill()
print (df1)
        0       1       2       3       4       5
0  Header  Header  Header  Header  Header  Header
1     NaN    Col1     NaN    Col2     NaN     NaN
2       A     NaN     NaN       C       B     NaN
3      X1      X2      X3      X4      X5      X6

df = pd.read_excel('Demo Xls.xlsx',header=None, skiprows=4)
df.columns = pd.MultiIndex.from_frame(df1.T)
df = df.rename_axis([None, None, None, None], axis=1)
print (df)
  Header                         
     NaN Col1  NaN Col2  NaN     
       A  NaN  NaN    C    B  NaN
      X1   X2   X3   X4   X5   X6
0    v11  v12  v13  v14  v15  v16
1    v21  v22  v23  v24  v25  v26
2    v31  v32  v33  v34  v35  v36
Sign up to request clarification or add additional context in comments.

6 Comments

Thank you. But I am still having trouble reading the columns right. Even if I were to make them all the same length, I would have expected the last col to be ('Header', '', '', 'X6') But pandas adds Col2 and B by default and I don't understand why
@PawanBhandarkar - Because in excel columns are merged, so pandas merged columns read together - from your input for Col2 are merged last 3 columns, so in ouput are in second level 3 times Col2
I am a little confused. From the above, you can see that only the cells above B and C are merged. So the cells above X6 are not merged, right? (second level, last column) I have added a link to the spreadsheet above the image, if it helps.
@PawanBhandarkar - I see, then reason should be pandas guess there is last column 'merged'. In another words pandas read nice MultiIndex correct, if not nice , it means some levels omittred, guess how should looks.
Yeah...I think so too. It's a little annoying because it seems to default to something like "ffil" that we see in the fillna() method. I cannot find anything in the docs to disable it.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.