I have loaded an excel file using pd.read_excel(). The dataframe has header data in the top 6 rows and footer data from the 213th row till the last row(232th). I need to remove the header and footer and get a new database. How can I do this using bollean masking?
2 Answers
Assume that the input file contains something like:
The title of the whole data set
=============================================
Date,Amount,Aaaa,Aaaa,Bbbb,Bbbb
, ,Start,Stop,Start,Stop
=============================================
2012-01-01,120.5,10,20,21,28
2012-01-02,130.9,12,24,25,30
2012-01-03,140.0,13,28,29,36
2012-01-04,150.7,15,32,31,44
2012-01-05,160.1,19,36,36,70
=============================================
Summary row - to be skipped
=============================================
Details:
- Row 0 thru 2 (zero-based numbers) - skip entirely.
- Row 3 and 4 - actual column titles (MultiIndex).
- Row 5 - skip.
- Data rows (in my example 5).
- Last 3 rows - footer, to be skipped.
Note also that the second title row contains spaces as 2 initial names. Otherwise read_csv assigns default names, composed of Unnamed: + a number. It is better to avoid them.
To read this content, you can call:
df = pd.read_csv('Input.csv', skiprows=[0, 1, 2, 5],
header=[0, 1], skipfooter=3, engine='python')
How these parameters are interpreted:
skiprows- skip rows with these indices.header- from these lines read column names.skipfooter- skip 3 last lines.engine- you must specify python, otherwise read_csv attempts to usethe default c engine, which dous not support skipfooter, so it switches to python engine, issuing a warning.
Note that line numbers passed as header indicate lines after skiprows (in the original input file their row indices are actually 3 and 4).
The result is:
Date Amount Aaaa Bbbb
Start Stop Start Stop
0 2012-01-01 120.5 10 20 21 28
1 2012-01-02 130.9 12 24 25 30
2 2012-01-03 140.0 13 28 29 36
3 2012-01-04 150.7 15 32 31 44
4 2012-01-05 160.1 19 36 36 70
i.e. a MultiIndex on columns (2 levels), 5 data rows and no source footer.
Edit
Other option is to skip entirely 6 initial lines. In this case:
- pass
skiprows=6to skip 6 initial lines, - pass
header=Noneto block reading of the first line as the column names row, then column names will be set as consecutive numbers stating from 0, - or pass
namesparameter with column names to be used (but this time only "plain" index on columns can be specified).
skprowsand askipfooterparameters, that allow u specify positions u want to remove from the top and bottom respectively