1

I have the following Pandas dataframe taken from an Excel file (link to the Excel file)

I would like to flatten the Excel table with Pandas by converting the current headers (two first rows) to dataframe columns. This is where I want to get to:

segment unit    category    sub_category    value
seg1    kg      cat01       sub_cat_1.1     1
seg2    kg      cat01       sub_cat_1.1     2
seg1    kg      cat01       sub_cat_1.2     3
seg2    kg      cat01       sub_cat_1.2     
seg1    kg      cat02       sub_cat_2.1     4
seg2    kg      cat02       sub_cat_2.1     5

What I did so far is the folowing, but it doesn't work as expected:

import pandas as pd

_file_name = "stackoverflow_excel_data_example.xlsx"
df = pd.read_excel(_file_name,  header=[0,1]).sort_index()
df = df.stack()
print(df)

Does anyone know how to convert a custom -kind of pivot- table to a flat dataframe?

2 Answers 2

1

No real magic here, you need to reorganize your MultiIndex before:

df.columns = pd.MultiIndex.from_tuples([('segment', ''), ('unit', '')] +
                                       df.columns[2:].to_list(),
                                       names=df.columns[1])

At this point, df looks like:

>>> df
category     segment unit       cat01                   cat02
sub_category              sub_cat_1.1 sub_cat_1.2 sub_cat_2.1 sub_cat_2.1.1 sub_cat_2.1.2 sub_cat_2.1.3 sub_cat_2.1.4
0               seg1   kg           1         3.0           4           NaN           NaN           NaN           NaN
1               seg2   kg           2         NaN           5           NaN           NaN           NaN           NaN

Now you can apply transformation:

>>> df.set_index(["segment", "unit"]) \
      .stack(level=[0, 1])\
      .rename("value") \
      .reset_index()

  segment unit category sub_category  value
0    seg1   kg    cat01  sub_cat_1.1    1.0
1    seg1   kg    cat01  sub_cat_1.2    3.0
2    seg1   kg    cat02  sub_cat_2.1    4.0
3    seg2   kg    cat01  sub_cat_1.1    2.0
4    seg2   kg    cat02  sub_cat_2.1    5.0
Sign up to request clarification or add additional context in comments.

1 Comment

Exactly what I needed. Thank you so much!
1
df = pd.read_excel(..., header=[0, 1])
df = (
    df
    .iloc[:, 2:]
    .set_index(df.iloc[:, 0])
    .set_index(df.iloc[:, 1], append=True)
    .stack([0, 1])
    .rename_axis(["segement", "quantity", "category", "sub_category"])
    .rename("value")
    .reset_index()
)

The result for the provided example input is enter image description here

2 Comments

Good answer really. +1. add .rename('value').reset_index() (and swap category and sub_category)
Thank you for your answer. As mentionned by @Corralien I had to add .rename('value').reset_index()

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.