0

I have loaded an excel file using pd.read_excel(). The dataframe has header data in the top 6 rows and footer data from the 213th row till the last row(232th). I need to remove the header and footer and get a new database. How can I do this using bollean masking?

2
  • the read excel option has skprows and a skipfooter parameters, that allow u specify positions u want to remove from the top and bottom respectively Commented May 6, 2020 at 5:57
  • Please convert your input (Excel) file to CSV version and add top / trailing lines from it (+ a few data rows), as text (not an image). The reason, especially for top rows, is that they may indicate whether to skip some of them or read as the MultiIndex on columns. Commented May 6, 2020 at 7:57

2 Answers 2

2

You can simply use the following code:

df=df[(df.index>6)&(df.index<213)]
df.index-=7
Sign up to request clarification or add additional context in comments.

Comments

1

Assume that the input file contains something like:

  The title of the whole data set

=============================================
Date,Amount,Aaaa,Aaaa,Bbbb,Bbbb
 , ,Start,Stop,Start,Stop
=============================================
2012-01-01,120.5,10,20,21,28
2012-01-02,130.9,12,24,25,30
2012-01-03,140.0,13,28,29,36
2012-01-04,150.7,15,32,31,44
2012-01-05,160.1,19,36,36,70
=============================================
Summary row - to be skipped
=============================================

Details:

  • Row 0 thru 2 (zero-based numbers) - skip entirely.
  • Row 3 and 4 - actual column titles (MultiIndex).
  • Row 5 - skip.
  • Data rows (in my example 5).
  • Last 3 rows - footer, to be skipped.

Note also that the second title row contains spaces as 2 initial names. Otherwise read_csv assigns default names, composed of Unnamed: + a number. It is better to avoid them.

To read this content, you can call:

df = pd.read_csv('Input.csv', skiprows=[0, 1, 2, 5],
    header=[0, 1], skipfooter=3, engine='python')

How these parameters are interpreted:

  • skiprows - skip rows with these indices.
  • header - from these lines read column names.
  • skipfooter - skip 3 last lines.
  • engine - you must specify python, otherwise read_csv attempts to usethe default c engine, which dous not support skipfooter, so it switches to python engine, issuing a warning.

Note that line numbers passed as header indicate lines after skiprows (in the original input file their row indices are actually 3 and 4).

The result is:

         Date Amount  Aaaa       Bbbb     
                     Start Stop Start Stop
0  2012-01-01  120.5    10   20    21   28
1  2012-01-02  130.9    12   24    25   30
2  2012-01-03  140.0    13   28    29   36
3  2012-01-04  150.7    15   32    31   44
4  2012-01-05  160.1    19   36    36   70

i.e. a MultiIndex on columns (2 levels), 5 data rows and no source footer.

Edit

Other option is to skip entirely 6 initial lines. In this case:

  • pass skiprows=6 to skip 6 initial lines,
  • pass header=None to block reading of the first line as the column names row, then column names will be set as consecutive numbers stating from 0,
  • or pass names parameter with column names to be used (but this time only "plain" index on columns can be specified).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.