How to remove footer and header while creating a pandas dataframe by loading an excel file

Question

I have loaded an excel file using pd.read_excel(). The dataframe has header data in the top 6 rows and footer data from the 213th row till the last row(232th). I need to remove the header and footer and get a new database. How can I do this using bollean masking?

the read excel option has skprows and a skipfooter parameters, that allow u specify positions u want to remove from the top and bottom respectively — sammywemmy
– sammywemmy, Commented May 6, 2020 at 5:57
Please convert your input (Excel) file to CSV version and add top / trailing lines from it (+ a few data rows), as text (not an image). The reason, especially for top rows, is that they may indicate whether to skip some of them or read as the MultiIndex on columns. — Valdi_Bo
– Valdi_Bo, Commented May 6, 2020 at 7:57

Harsha · Accepted Answer · 2020-05-06 04:05:07Z

2

You can simply use the following code:

df=df[(df.index>6)&(df.index<213)]
df.index-=7

answered May 6, 2020 at 4:05

Harsha

6034 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Valdi_Bo · Accepted Answer · 2020-05-06 08:52:50Z

Assume that the input file contains something like:

  The title of the whole data set

=============================================
Date,Amount,Aaaa,Aaaa,Bbbb,Bbbb
 , ,Start,Stop,Start,Stop
=============================================
2012-01-01,120.5,10,20,21,28
2012-01-02,130.9,12,24,25,30
2012-01-03,140.0,13,28,29,36
2012-01-04,150.7,15,32,31,44
2012-01-05,160.1,19,36,36,70
=============================================
Summary row - to be skipped
=============================================

Details:

Row 0 thru 2 (zero-based numbers) - skip entirely.
Row 3 and 4 - actual column titles (MultiIndex).
Row 5 - skip.
Data rows (in my example 5).
Last 3 rows - footer, to be skipped.

Note also that the second title row contains spaces as 2 initial names. Otherwise read_csv assigns default names, composed of Unnamed: + a number. It is better to avoid them.

To read this content, you can call:

df = pd.read_csv('Input.csv', skiprows=[0, 1, 2, 5],
    header=[0, 1], skipfooter=3, engine='python')

How these parameters are interpreted:

skiprows - skip rows with these indices.
header - from these lines read column names.
skipfooter - skip 3 last lines.
engine - you must specify python, otherwise read_csv attempts to usethe default c engine, which dous not support skipfooter, so it switches to python engine, issuing a warning.

Note that line numbers passed as header indicate lines after skiprows (in the original input file their row indices are actually 3 and 4).

The result is:

         Date Amount  Aaaa       Bbbb     
                     Start Stop Start Stop
0  2012-01-01  120.5    10   20    21   28
1  2012-01-02  130.9    12   24    25   30
2  2012-01-03  140.0    13   28    29   36
3  2012-01-04  150.7    15   32    31   44
4  2012-01-05  160.1    19   36    36   70

i.e. a MultiIndex on columns (2 levels), 5 data rows and no source footer.

Edit

Other option is to skip entirely 6 initial lines. In this case:

pass skiprows=6 to skip 6 initial lines,
pass header=None to block reading of the first line as the column names row, then column names will be set as consecutive numbers stating from 0,
or pass names parameter with column names to be used (but this time only "plain" index on columns can be specified).

Collectives™ on Stack Overflow

How to remove footer and header while creating a pandas dataframe by loading an excel file

2 Answers 2

Comments

Edit

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Edit

Comments

Your Answer

Sign up or log in

Post as a guest

Related