I currently have an excel file that looks like this where the first 4 rows are a part of the header:
What I am looking for is to read it using python while maintaining the multi-indexes on the columns.
Currently, I use the following code to read the excel using pandas:
df=pd.read_excel("./Demo Xls.xlsx",header=[0, 1, 2, 3] )
def clean_multi_index(column: Tuple) -> Tuple:
return tuple([x for x in column if "Unnamed:" not in x])
cleaned_cols = [clean_multi_index(x) for x in df.columns]
cleaned_cols
However, the columns printed are:
[('Header', 'A', 'X1'),
('Header', 'Col1', 'X2'),
('Header', 'Col1', 'X3'),
('Header', 'Col2', 'C', 'X4'),
('Header', 'Col2', 'B', 'X5'),
('Header', 'Col2', 'B', 'X6')]
As you can see, it seems to make some assumptions about how to fill in the higher level columns when they are empty.
For example, X6 is not a part of the merged Col2 and B levels The result I am expecting is something like:
[('Header', 'A', 'X1'),
('Header', 'Col1', 'X2'),
('Header', 'X3'),
('Header', 'Col2', 'C', 'X4'),
('Header', 'Col2', 'B', 'X5'),
('Header', 'X6')]
Almost all the examples I have seen online that talk about reading excel in python cover either merged rows or cases where the higher level rows are always merged. (i.e, when one merge ends, another begins)
Unfortunately, I cannot modify the data to improve its structure and need to work with the xls file as-is.
