how to slice and create multiple pandas dataframes from a singe dataframe

Question

I am reading an excel file using pandas. I want to create multiple data frames from the original data frame. each data frame name should be the row 1 heading. Also, how to skip the one column between each transaction.

Expected result:

transaction_1:
name id available capacity completed all

transaction_2:
name id available capacity completed all

transaction_3:
name id available capacity completed all

What I tried:

import pandas as pd
import pprint as pp
pd.options.display.width = 0
pd.options.display.max_columns = 999
pd.options.display.max_rows = 999
df = pd.read_excel(r'capacity.xlsx', sheet_name='Sprint Details', header=0)
df1 = df.iloc[:, 0:3]
print(df1)

Cameron Riddell · Accepted Answer · 2020-11-12 06:52:47Z

1

You can try this (works with pd.__version__ == 1.1.1):

df = (pd.read_excel(
          "capacity.xlsx", sheet_name="Sprint Details", header=[0, 1], index_col=[0, 1]
       )
      .dropna(axis=1, how="all")
      .rename_axis(index=["name", "id"], columns=[None, None]))

transaction_1 = df["transaction_1"].reset_index()
transaction_2 = df["transaction_2"].reset_index()
transaction_3 = df["transaction_3"].reset_index()

Essentially, we need to read the sheet in as a dataframe with a MultiIndex. The first 2 rows are our column names header=[0,1]. Whereas the first 2 columns are our index that will be used for each "subtable" index_col=[0,1].

Because there are spaces in each table, we will have columns that are entirely NaN so we drop those with .dropna(axis=1, how="all").

Because pandas does not expect the index names and columns to be in the same row, it should incorrectly parse your index column names ["name", "id"] as the name of the second level of the column index. To remedy this, we can manually assign the correct index name, while also removing the column index names via rename_axis(index=["name", "id"], columns=[None, None])

Now that we have a nicely formatted table with a MultiIndex column, we can simply slice out each table, and call .reset_index() on each to ensure that each table has the "name" and "id" as a column in each table.

Edit: Seems we have a parsing difference between our versions of pandas.

Option 1. If you can directly modify the excel sheet to include another row (to better separate the columns from the index names). This will provide the most robust results.

The following code works:

df = (pd.read_excel(
          "capacity.xlsx", sheet_name="Sprint Details", header=[0, 1], index_col=[0, 1]
       )
      .dropna(axis=1, how="all"))

transaction_1 = df["transaction_1"].reset_index()
transaction_2 = df["transaction_2"].reset_index()
transaction_3 = df["transaction_3"].reset_index()

Option 2

If you can not modify the excel file, we'll need a more roundabout method unfortunately.

df = pd.read_excel("capacity.xlsx", header=[0,1]).dropna(axis=1, how="all")
index = pd.MultiIndex.from_frame(df.iloc[:, :2].droplevel(0, axis=1))

df = df.iloc[:, 2:].set_axis(index)

transaction_1 = df["transaction_1"].reset_index()
transaction_2 = df["transaction_2"].reset_index()
transaction_3 = df["transaction_3"].reset_index()

edited Nov 12, 2020 at 6:52

answered Nov 12, 2020 at 5:07

Cameron Riddell

13.8k14 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Mona Over a year ago

thanks @Cameron Riddell. I am getting this error after I implemented your code raise ValueError(f"Length of new names must be 1, got {len(values)}") ValueError: Length of new names must be 1, got 2

Cameron Riddell Over a year ago

To double check: are you sure you have header=[0,1] and index_col=[0,1] in your code when reading the excel file? The error is complaining that either your index or column only has 1 levels and the rename_axis function is providing input assuming there are 2 levels. You can try commenting out the rename_axis portion of the code and looking at the output of df.columns and df.index to ensure that you have 2 pd.MultiIndex objects.

Mona Over a year ago

yep. it's the same. ok let me try commenting out and seeing

Cameron Riddell Over a year ago

So, even with deleting the .rename_axis(...) method you're still receiving an error stating ValueError: Length of new names must be 1, got 2?

Cameron Riddell Over a year ago

Just updated the answer with some potential work arounds. It seems like we're working with different versions of pandas and is parsing the MultiIndex differently (yours leads to an error, and mine leads to a result that I overwrite)

|

Collectives™ on Stack Overflow

how to slice and create multiple pandas dataframes from a singe dataframe

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related