Filtering Excel Document Data In Jupyter Notebook Using Pandas

Question

I have a code Filtering Data, that I wan't displayed from an Excel Document using Pandas in Jupyter Notebook. It is for a UK RAF Historic Aircraft Display Team, Year 2009 Appearance Schedule.

Here is my Python Code :-

import pandas as pd

xls = pd.ExcelFile(r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')

data = pd.read_excel(xls, sheet_name="Sheet1")

pd.options.display.max_rows = 1000

df = pd.DataFrame(data, columns= ['Venue','A/C','DISPLAY/','Date','BID'])

df[(df['Venue'].str.contains('[a-zA-Z]') & (df['DISPLAY/'].str.contains('DISPLAY') & df['A/C'].str.contains("DHS|DAK|HS|SPIT")) & (df['A/C'] != 'LHS') & (df['A/C'] != 'LANC'))]

I am unsure what to type, to filter the Data, for when the Numerical Value in the BID Column, is the same in the BID column in the next row. And also in addition, only when one of the Aircraft in the A/C Column, where both Numerical Values in the BID Column below and above are the same, is DAK, and excluding that principle, only if in a row for the A/C Column shows DHS Could someone please tell me, what I should add to my Python Code to enable this, it would be much appreciated if someone could.

Also for example with the filtered data for example, I would like :-

Output:

145     SCARBOROUGH     DAK     DISPLAY     2008-05-25 00:00:00     610
150     SCARBOROUGH     SPIT    DISPLAY     2008-05-25 00:00:00     610

Changed to showing the following, i.e. merging the two lines together :-

Output:

SCARBOROUGH     DS  DISPLAY     2008-05-25 00:00:00     610

And

Output:

173     TARRANT RUSHDEN     HS  DISPLAY     NaN     132
174     TARRANT RUSHDEN     DAK     DISPLAY     NaN     132

Changed to showing :-

Output:

TARRANT RUSHDEN     DHS     DISPLAY     NaN     132

I mean changed to showing, for all those occurrences,

Not just for those two Venues.

Here is an Sample Of My Output Data :-

Venue   A/C     DISPLAY/    Date    BID
25  SHUTTLEWORTH    DAK     DISPLAY     NaN     529
55  KEMBLE  DAK     DISPLAY     NaN     461
69  NORTHWICH   SPIT    DISPLAY     2008-05-10 00:00:00     514
72  POCKLINGTON     SPIT    DISPLAY     2009-05-10 00:00:00     821
75  BERLIN  DAK     DISPLAY     2008-05-12 00:00:00     587
78  MILDENHALL  SPIT    DISPLAY     2009-05-15 00:00:00     920
93  DUXFORD     HS  DISPLAY     NaN     611
103     CRANWELL    HS  DISPLAY     2008-05-20 00:00:00     44
145     SCARBOROUGH     DAK     DISPLAY     2008-05-25 00:00:00     610
150     SCARBOROUGH     SPIT    DISPLAY     2008-05-25 00:00:00     610
151     CORBRIDGE   SPIT    DISPLAY     NaN     353
167     BRIDGEND-CNX    SPIT    DISPLAY     2008-05-31 00:00:00     527
173     TARRANT RUSHDEN     HS  DISPLAY     NaN     132
174     TARRANT RUSHDEN     DAK     DISPLAY     NaN     132
179     NORTHOLT    SPIT    DISPLAY     2009-06-05 00:00:00     870
214     BRIZE NORTON    HS  DISPLAY     NaN     939
218     ROPLEY  HS  DISPLAY     2008-06-13 00:00:00     355
223     THWAITES    HS  DISPLAY     NaN     364
231     ROPLEY  HS  DISPLAY     NaN     355
240     COSFORD     HS  DISPLAY     2008-06-14 00:00:00     667
241     QUORN   HS  DISPLAY     NaN     314
244     COSFORD     DAK     DISPLAY     2008-06-14 00:00:00     NaN
260     REDHILL     SPIT    DISPLAY     NaN     686
269     KEMBLE  DAK     DISPLAY     NaN     316
270     KEMBLE  HS  DISPLAY     NaN     316
280     KEMBLE  SPIT    DISPLAY     2008-06-21 00:00:00     316
285     KEMBLE  DAK     DISPLAY     2008-06-21 00:00:00     316

Here is the Website Link, to the .xls i.e. Excel Document File :-

http://web.archive.org/web/20090804234934/http://www.raf.mod.uk/bbmf/rafcms/mediafiles/F0ED6EA8_1143_EC82_2E4534A1036AA506.xls

You will obviously need to change the following in my Python Code, to whatever you call the .xls File. And the path, of where you save it, on your Computer :-

xls = pd.ExcelFile(r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')

I have changed the end bit of the Code to :-

selected = df.loc[df['A/C'] == 'DS', 'DH', 'DHS']
groupby_venue_date = selected.groupby(['Venue', 'BID', 'DISPLAY/'])
aircraft = groupby_venue_date['A/C std'].apply(''.join).rename('Aircraft-combined')
print(aircraft.shape)
pd.DataFrame(aircraft)

But get a :- IndexingError: Too many indexers message, when I run the Code, what does that mean ? And what has caused the Error Bill ?

This is the Code I am currently running as of 2nd January 2020 :-

import pandas as pd

xls = pd.ExcelFile(r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')

data = pd.read_excel(xls, sheet_name="Sheet1")

pd.options.display.max_rows = 1000

df = pd.DataFrame(data, columns= ['Venue','A/C','DISPLAY/','Date','BID'])

#df[(df['Venue'].str.contains('[a-zA-Z]') & (df['DISPLAY/'].str.contains('DISPLAY') & df['A/C'].str.contains("DHS|DAK|HS|SPIT")) & (df['A/C'] != 'LHS') & (df['A/C'] != 'LANC'))] 
df["Date"].fillna("No Date", inplace = True)

df['A/C'].unique().tolist()

rename_map = {
'DAK': 'D',
'SPIT': 'S',
'LANC': 'L',
'HURRI': 'H',
'PARA': 'P'
}
df['A/C std'] = df['A/C'].replace(rename_map)
print(df['A/C std'].unique().tolist())

#selected = df.loc[df['A/C'] == 'DS', 'DH', 'DHS']
selected = df.loc[df['DISPLAY/'] == 'DISPLAY']

groupby_venue_date = selected.groupby(['Venue', 'BID', 'Date', 'DISPLAY/']) 
aircraft = groupby_venue_date['A/C std'].apply(''.join).rename('Aircraft-combined')
print(aircraft.shape)
pd.DataFrame(aircraft)

Sounds like you want to remove some duplicates which are identical except in the 'A/C' column. Is that right? But what is the logic for the replacement values 'DS' and 'DHS' which appear in that column after the merge? — Bill
– Bill, Commented Jan 1, 2020 at 20:48
Also, could you provide a sample of the input data? Either part of the csv file or part of df maybe. Then we can run your script to see what is happening. — Bill
– Bill, Commented Jan 1, 2020 at 20:58
Your nearly right Bill, I actually want to keep the duplicates which are identical, except in the 'A/C' Column. Answering your other point, DS stands for Dakota and Spitfire. And DHS stands for Dakota Spitfire and Hurricane. — Edward Winch
– Edward Winch, Commented Jan 1, 2020 at 20:59
Shall I provide the Website Link, to the xls File, so it can be downloaded ? — Edward Winch
– Edward Winch, Commented Jan 1, 2020 at 21:01
Hi Bill, Here is the Website Link, to the .XLS File, i.e. Excel Document File :- web.archive.org/web/20090804234934/http://www.raf.mod.uk/bbmf/… — Edward Winch
– Edward Winch, Commented Jan 1, 2020 at 21:16

Bill · Accepted Answer · 2020-01-02 17:35:24Z

0

I'm not sure I understand exactly what you want to do but I'll try to help by providing some techniques that might help you figure it out.

For example, getting a list of the unique values for a column:

df['A/C'].unique().tolist()

[nan, 'L', 'S', 'H', 'LHS', 'LANC', 'DAK', 'SPIT', 'HS', 'HURRI', 'PARA', 'LSSD', 'LSS', 'SS', 'LH', 'DH', 'DHS', 'SSSHH']

Part of the problem appears to be dealing with these short-hand entries which are combinations of different aircraft. E.g. you said 'DHS' stands for Dakota, Spitfire, and Hurricane. It might be better to deal these non-standard values first before trying to merge the rows. One way is to replace all non-standard values using a dictionary.

For example

rename_map = {
    'DAK': 'D',
    'SPIT': 'S',
    'LANC': 'L',
    'HURRI': 'H',
    'PARA': 'P'
}
df['A/C std'] = df['A/C'].replace(rename_map)
print(df['A/C std'].unique().tolist())

[nan, 'L', 'S', 'H', 'LHS', 'D', 'HS', 'P', 'LSSD', 'LSS', 'SS', 'LH', 'DH', 'DHS', 'SSSHH']

You can then do whatever it is you want. For example, select a sub-set of the data:

selected = df.loc[df['DISPLAY/'] == 'DISPLAY']
assert selected.shape == (202, 6)

And then group rows by selected columns and joining the aircraft codes using the string-join method:

groupby_venue_date = selected.groupby(['Venue', 'Date'])
aircraft = groupby_venue_date['A/C std'].apply(''.join).rename('Aircraft-combined')
assert aircraft.index.duplicated().sum() == 0
print(aircraft.shape)
print(aircraft.head())

(89,)
Venue     Date      
AUDLEM    2008-07-26      S
AYLSHAM   2008-08-31    LHS
BEAULIEU  2008-05-25      H
BELTRING  2008-07-26      L
BENSON    2008-08-27    LHS
Name: Aircraft-combined, dtype: object

Some of the values have been joined:

print(aircraft.unique().tolist())
['S', 'LHS', 'H', 'L', 'D', 'HS', 'HSD', 'SLH', 'DHS', 'SD', 'SSSHH', 'LH', 'DS', 'DH', 'HSL']

UPDATE

You can do other operations on these codes by making functions and using the apply method.

For example sorting the string, or removing duplicated characters (which also happens to sort them).

def sorted_string(s):
    return ''.join(sorted(s))

def remove_duplicate_chars(s):
    return ''.join(set(s))

aircraft = aircraft.apply(remove_duplicate_chars)
print(aircraft.unique().tolist())

['S', 'LHS', 'H', 'L', 'D', 'HS', 'DHS', 'DS', 'LH', 'DH']

edited Jan 2, 2020 at 17:35

answered Jan 1, 2020 at 22:12

Bill

11.8k13 gold badges68 silver badges100 bronze badges

Sign up to request clarification or add additional context in comments.

19 Comments

Edward Winch Over a year ago

Thankyou so much Bill, for all your help this evening. Just wondering what do the following lines of Code mean ? :- assert selected.shape == (202, 6) + assert aircraft.index.duplicated().sum() == 0 ?

Edward Winch Over a year ago

And what would I need to type, so that all, the DHS, DS, DH, are displayed ? Is it possible to have the Data displayed changed to the latest, with the original Font, instead of Boolean Type Text ?

Bill Over a year ago

You can ignore those. Just checks to make sure the number of rows and columns is correct and to demonstrate that after the groupby there are no duplicates left. You don't need them.

Edward Winch Over a year ago

I.e. showing, like when you run the original Python Code, in Jupyter Notebook, only with the current changes ?

Bill Over a year ago

I think you're referring to the print(aircraft.head()) statement. aircraft is pd.Series not a DataFrame. That is why it displays like that. To see the 'pretty' version in a Jupyter notebook use pd.DataFrame(aircraft)

|

Collectives™ on Stack Overflow

Filtering Excel Document Data In Jupyter Notebook Using Pandas

1 Answer 1

19 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

19 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related