1

I'm having trouble describing the way my data is structured, hopefully this image will help. enter image description here

As you can see I have 3 columns: Name, Corp #, and Dissagregation Category, that together serve as a unique identifier for each row. There are two super headers, "English/Language Arts Scores", and "Mathematics Scores." I want to create two separate dataframes for each header. This is what I've managed so far:

df1 = pd.read_excel(file, header=None)
vals = df1.values
mux = pd.MultiIndex.from_arrays(df1.ffill(1).values[:2, 1:], names=[None, 'Name'])
df = pd.DataFrame(df1.values[2:, 1:], df1.values[2:, 0], mux)

This generates a dataframe with two levels: df['English/Language Arts Scores'] and df['Mathematics Scores'], that contain the columns under the header and "Name" as the index. I want these dataframes to include: Name, Copr #, and Dissagregation Category as columns or as the index.

What would be the best way to go about doing this?

EDIT

Here is a copy and pasteable snippet of my data:

                                          English        Math   
         A        B          C           X       Y     X       Y
 ADAMS CENTRAL  0015    All Students    83      590    83     579
 ADAMS CENTRAL  0015    General 1       0       ***    0      ***
 ADAMS CENTRAL  0015    Total Gene      71      590    71     579
 ADAMS West     0016    All Students    93      440    83     765
 ADAMS West     0016    General 1       1        33    0      660
 ADAMS West     0016    Total Gene      31      ***    46     572

The output that I want should look like this:

In [1]: df['English']
Out[1]:                                              
         A        B          C           X       Y     
 ADAMS CENTRAL  0015    All Students    83      590    
 ADAMS CENTRAL  0015    General 1       0       ***    
 ADAMS CENTRAL  0015    Total Gene      71      590    
 ADAMS West     0016    All Students    93      440    
 ADAMS West     0016    General 1       1        33    
 ADAMS West     0016    Total Gene      31      ***  

In [2]: df['Math']
Out[2]:                                                 
         A        B          C           X       Y
 ADAMS CENTRAL  0015    All Students    83      579
 ADAMS CENTRAL  0015    General 1       0       ***
 ADAMS CENTRAL  0015    Total Gene      71      579
 ADAMS West     0016    All Students    83      765
 ADAMS West     0016    General 1       0       660
 ADAMS West     0016    Total Gene      46      572
4
  • 1
    Please see how to make good, reproducible pandas examples. Posting images of a spreadsheet program is not very helpful. Commented Nov 29, 2017 at 17:57
  • Thanks. Let me see if I can make the data into runnable code. Commented Nov 29, 2017 at 18:01
  • could you maybe provide a sample expected output: the phrase ` I want to create two separate dataframes for each header` seems a little ambiguous to me and I want to make sure the viewers will interpret your need correctly Commented Nov 29, 2017 at 18:04
  • Of course. It is a little vague. Commented Nov 29, 2017 at 18:15

1 Answer 1

1

We can using read_excel, read the file, then using .loc, slice the columns level0 you need

df = pd.read_excel('yourexcel.xlsx',
                   header=[0,1],
                   index_col=[0,1,2],
                   sheetname="Sheet1")

df.loc[:,'English']
Out[837]: 
C                               X    Y
ADAMS CENTRAL 15 All Students  83  590
                 General 1      0  ***
                 Total Gene    71  590
ADAMS West    16 All Students  93  440
                 General 1      1   33
                 Total Gene    31  ***

To make it more neat add rename_axis

df.loc[:,'English'].rename_axis(None,1)
Out[840]: 
                                X    Y
ADAMS CENTRAL 15 All Students  83  590
                 General 1      0  ***
                 Total Gene    71  590
ADAMS West    16 All Students  93  440
                 General 1      1   33
                 Total Gene    31  ***

Our base on your method

vals = df.iloc[3:,:].values
df1 = pd.DataFrame(df.values[3:, 3:], df.values[3:, 0:3])
mux = pd.MultiIndex.from_arrays(df.ffill().ffill(1).values[1:3, 3:])
df1.columns=mux
df1.index = pd.MultiIndex.from_tuples(df1.index)
Sign up to request clarification or add additional context in comments.

9 Comments

I'm getting the error ValueError: Length of new names must be 1, got 2. When I read in the excel there are blank columns at the end of the file (in my example after Y under Math). If I manually delete them the error goes away. Is there a way to avoid the error without manually deleting the columns each time?
@otteheng base on your data, I can not reproduce the problem, Maybe you can show the the file ?
@otteheng do you know how many columns you want to read into python from excel ?
The number of columns varies from excel to excel unfortunately. Is there a way to incrementally remove the last column until the command no longer errors?
@otteheng pandas.pydata.org/pandas-docs/stable/generated/…, you can check the link, I will guess for your case the use.col may help
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.