Multiple columns as the index for each level in a MultiIndex dataframe

Question

I'm having trouble describing the way my data is structured, hopefully this image will help.

As you can see I have 3 columns: Name, Corp #, and Dissagregation Category, that together serve as a unique identifier for each row. There are two super headers, "English/Language Arts Scores", and "Mathematics Scores." I want to create two separate dataframes for each header. This is what I've managed so far:

df1 = pd.read_excel(file, header=None)
vals = df1.values
mux = pd.MultiIndex.from_arrays(df1.ffill(1).values[:2, 1:], names=[None, 'Name'])
df = pd.DataFrame(df1.values[2:, 1:], df1.values[2:, 0], mux)

This generates a dataframe with two levels: df['English/Language Arts Scores'] and df['Mathematics Scores'], that contain the columns under the header and "Name" as the index. I want these dataframes to include: Name, Copr #, and Dissagregation Category as columns or as the index.

What would be the best way to go about doing this?

EDIT

Here is a copy and pasteable snippet of my data:

                                          English        Math   
         A        B          C           X       Y     X       Y
 ADAMS CENTRAL  0015    All Students    83      590    83     579
 ADAMS CENTRAL  0015    General 1       0       ***    0      ***
 ADAMS CENTRAL  0015    Total Gene      71      590    71     579
 ADAMS West     0016    All Students    93      440    83     765
 ADAMS West     0016    General 1       1        33    0      660
 ADAMS West     0016    Total Gene      31      ***    46     572

The output that I want should look like this:

In [1]: df['English']
Out[1]:                                              
         A        B          C           X       Y     
 ADAMS CENTRAL  0015    All Students    83      590    
 ADAMS CENTRAL  0015    General 1       0       ***    
 ADAMS CENTRAL  0015    Total Gene      71      590    
 ADAMS West     0016    All Students    93      440    
 ADAMS West     0016    General 1       1        33    
 ADAMS West     0016    Total Gene      31      ***  

In [2]: df['Math']
Out[2]:                                                 
         A        B          C           X       Y
 ADAMS CENTRAL  0015    All Students    83      579
 ADAMS CENTRAL  0015    General 1       0       ***
 ADAMS CENTRAL  0015    Total Gene      71      579
 ADAMS West     0016    All Students    83      765
 ADAMS West     0016    General 1       0       660
 ADAMS West     0016    Total Gene      46      572

Please see how to make good, reproducible pandas examples. Posting images of a spreadsheet program is not very helpful. — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Nov 29, 2017 at 17:57
Thanks. Let me see if I can make the data into runnable code. — otteheng
– otteheng, Commented Nov 29, 2017 at 18:01
could you maybe provide a sample expected output: the phrase ` I want to create two separate dataframes for each header` seems a little ambiguous to me and I want to make sure the viewers will interpret your need correctly — MattR
– MattR, Commented Nov 29, 2017 at 18:04

BENY · Accepted Answer · 2017-11-30 23:24:07Z

1

We can using read_excel, read the file, then using .loc, slice the columns level0 you need

df = pd.read_excel('yourexcel.xlsx',
                   header=[0,1],
                   index_col=[0,1,2],
                   sheetname="Sheet1")

df.loc[:,'English']
Out[837]: 
C                               X    Y
ADAMS CENTRAL 15 All Students  83  590
                 General 1      0  ***
                 Total Gene    71  590
ADAMS West    16 All Students  93  440
                 General 1      1   33
                 Total Gene    31  ***

To make it more neat add rename_axis

df.loc[:,'English'].rename_axis(None,1)
Out[840]: 
                                X    Y
ADAMS CENTRAL 15 All Students  83  590
                 General 1      0  ***
                 Total Gene    71  590
ADAMS West    16 All Students  93  440
                 General 1      1   33
                 Total Gene    31  ***

Our base on your method

vals = df.iloc[3:,:].values
df1 = pd.DataFrame(df.values[3:, 3:], df.values[3:, 0:3])
mux = pd.MultiIndex.from_arrays(df.ffill().ffill(1).values[1:3, 3:])
df1.columns=mux
df1.index = pd.MultiIndex.from_tuples(df1.index)

edited Nov 30, 2017 at 23:24

answered Nov 29, 2017 at 19:13

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

otteheng Over a year ago

I'm getting the error ValueError: Length of new names must be 1, got 2. When I read in the excel there are blank columns at the end of the file (in my example after Y under Math). If I manually delete them the error goes away. Is there a way to avoid the error without manually deleting the columns each time?

BENY Over a year ago

@otteheng base on your data, I can not reproduce the problem, Maybe you can show the the file ?

BENY Over a year ago

@otteheng do you know how many columns you want to read into python from excel ?

otteheng Over a year ago

The number of columns varies from excel to excel unfortunately. Is there a way to incrementally remove the last column until the command no longer errors?

BENY Over a year ago

@otteheng pandas.pydata.org/pandas-docs/stable/generated/…, you can check the link, I will guess for your case the use.col may help

|

Collectives™ on Stack Overflow

Multiple columns as the index for each level in a MultiIndex dataframe

1 Answer 1

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related