0

I am reading an excel file using pandas. I want to create multiple data frames from the original data frame. each data frame name should be the row 1 heading. Also, how to skip the one column between each transaction.

Expected result:

transaction_1:
name id available capacity completed all

transaction_2:
name id available capacity completed all

transaction_3:
name id available capacity completed all

What I tried:

import pandas as pd
import pprint as pp
pd.options.display.width = 0
pd.options.display.max_columns = 999
pd.options.display.max_rows = 999
df = pd.read_excel(r'capacity.xlsx', sheet_name='Sprint Details', header=0)
df1 = df.iloc[:, 0:3]
print(df1)

enter image description here

1 Answer 1

1

You can try this (works with pd.__version__ == 1.1.1):

df = (pd.read_excel(
          "capacity.xlsx", sheet_name="Sprint Details", header=[0, 1], index_col=[0, 1]
       )
      .dropna(axis=1, how="all")
      .rename_axis(index=["name", "id"], columns=[None, None]))

transaction_1 = df["transaction_1"].reset_index()
transaction_2 = df["transaction_2"].reset_index()
transaction_3 = df["transaction_3"].reset_index()

Essentially, we need to read the sheet in as a dataframe with a MultiIndex. The first 2 rows are our column names header=[0,1]. Whereas the first 2 columns are our index that will be used for each "subtable" index_col=[0,1].

Because there are spaces in each table, we will have columns that are entirely NaN so we drop those with .dropna(axis=1, how="all").

Because pandas does not expect the index names and columns to be in the same row, it should incorrectly parse your index column names ["name", "id"] as the name of the second level of the column index. To remedy this, we can manually assign the correct index name, while also removing the column index names via rename_axis(index=["name", "id"], columns=[None, None])

Now that we have a nicely formatted table with a MultiIndex column, we can simply slice out each table, and call .reset_index() on each to ensure that each table has the "name" and "id" as a column in each table.


Edit: Seems we have a parsing difference between our versions of pandas.

Option 1. If you can directly modify the excel sheet to include another row (to better separate the columns from the index names). This will provide the most robust results. enter image description here

The following code works:

df = (pd.read_excel(
          "capacity.xlsx", sheet_name="Sprint Details", header=[0, 1], index_col=[0, 1]
       )
      .dropna(axis=1, how="all"))

transaction_1 = df["transaction_1"].reset_index()
transaction_2 = df["transaction_2"].reset_index()
transaction_3 = df["transaction_3"].reset_index()

Option 2

If you can not modify the excel file, we'll need a more roundabout method unfortunately.

df = pd.read_excel("capacity.xlsx", header=[0,1]).dropna(axis=1, how="all")
index = pd.MultiIndex.from_frame(df.iloc[:, :2].droplevel(0, axis=1))

df = df.iloc[:, 2:].set_axis(index)

transaction_1 = df["transaction_1"].reset_index()
transaction_2 = df["transaction_2"].reset_index()
transaction_3 = df["transaction_3"].reset_index()
Sign up to request clarification or add additional context in comments.

6 Comments

thanks @Cameron Riddell. I am getting this error after I implemented your code raise ValueError(f"Length of new names must be 1, got {len(values)}") ValueError: Length of new names must be 1, got 2
To double check: are you sure you have header=[0,1] and index_col=[0,1] in your code when reading the excel file? The error is complaining that either your index or column only has 1 levels and the rename_axis function is providing input assuming there are 2 levels. You can try commenting out the rename_axis portion of the code and looking at the output of df.columns and df.index to ensure that you have 2 pd.MultiIndex objects.
yep. it's the same. ok let me try commenting out and seeing
So, even with deleting the .rename_axis(...) method you're still receiving an error stating ValueError: Length of new names must be 1, got 2?
Just updated the answer with some potential work arounds. It seems like we're working with different versions of pandas and is parsing the MultiIndex differently (yours leads to an error, and mine leads to a result that I overwrite)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.