0

I have a code Filtering Data, that I wan't displayed from an Excel Document using Pandas in Jupyter Notebook. It is for a UK RAF Historic Aircraft Display Team, Year 2009 Appearance Schedule.

Here is my Python Code :-

import pandas as pd

xls = pd.ExcelFile(r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')

data = pd.read_excel(xls, sheet_name="Sheet1")

pd.options.display.max_rows = 1000

df = pd.DataFrame(data, columns= ['Venue','A/C','DISPLAY/','Date','BID'])

df[(df['Venue'].str.contains('[a-zA-Z]') & (df['DISPLAY/'].str.contains('DISPLAY') & df['A/C'].str.contains("DHS|DAK|HS|SPIT")) & (df['A/C'] != 'LHS') & (df['A/C'] != 'LANC'))]

I am unsure what to type, to filter the Data, for when the Numerical Value in the BID Column, is the same in the BID column in the next row. And also in addition, only when one of the Aircraft in the A/C Column, where both Numerical Values in the BID Column below and above are the same, is DAK, and excluding that principle, only if in a row for the A/C Column shows DHS Could someone please tell me, what I should add to my Python Code to enable this, it would be much appreciated if someone could.

Also for example with the filtered data for example, I would like :-

Output:

145     SCARBOROUGH     DAK     DISPLAY     2008-05-25 00:00:00     610
150     SCARBOROUGH     SPIT    DISPLAY     2008-05-25 00:00:00     610

Changed to showing the following, i.e. merging the two lines together :-

Output:

SCARBOROUGH     DS  DISPLAY     2008-05-25 00:00:00     610

And

Output:

173     TARRANT RUSHDEN     HS  DISPLAY     NaN     132
174     TARRANT RUSHDEN     DAK     DISPLAY     NaN     132

Changed to showing :-

Output:

TARRANT RUSHDEN     DHS     DISPLAY     NaN     132

I mean changed to showing, for all those occurrences,

Not just for those two Venues.

Here is an Sample Of My Output Data :-

Venue   A/C     DISPLAY/    Date    BID
25  SHUTTLEWORTH    DAK     DISPLAY     NaN     529
55  KEMBLE  DAK     DISPLAY     NaN     461
69  NORTHWICH   SPIT    DISPLAY     2008-05-10 00:00:00     514
72  POCKLINGTON     SPIT    DISPLAY     2009-05-10 00:00:00     821
75  BERLIN  DAK     DISPLAY     2008-05-12 00:00:00     587
78  MILDENHALL  SPIT    DISPLAY     2009-05-15 00:00:00     920
93  DUXFORD     HS  DISPLAY     NaN     611
103     CRANWELL    HS  DISPLAY     2008-05-20 00:00:00     44
145     SCARBOROUGH     DAK     DISPLAY     2008-05-25 00:00:00     610
150     SCARBOROUGH     SPIT    DISPLAY     2008-05-25 00:00:00     610
151     CORBRIDGE   SPIT    DISPLAY     NaN     353
167     BRIDGEND-CNX    SPIT    DISPLAY     2008-05-31 00:00:00     527
173     TARRANT RUSHDEN     HS  DISPLAY     NaN     132
174     TARRANT RUSHDEN     DAK     DISPLAY     NaN     132
179     NORTHOLT    SPIT    DISPLAY     2009-06-05 00:00:00     870
214     BRIZE NORTON    HS  DISPLAY     NaN     939
218     ROPLEY  HS  DISPLAY     2008-06-13 00:00:00     355
223     THWAITES    HS  DISPLAY     NaN     364
231     ROPLEY  HS  DISPLAY     NaN     355
240     COSFORD     HS  DISPLAY     2008-06-14 00:00:00     667
241     QUORN   HS  DISPLAY     NaN     314
244     COSFORD     DAK     DISPLAY     2008-06-14 00:00:00     NaN
260     REDHILL     SPIT    DISPLAY     NaN     686
269     KEMBLE  DAK     DISPLAY     NaN     316
270     KEMBLE  HS  DISPLAY     NaN     316
280     KEMBLE  SPIT    DISPLAY     2008-06-21 00:00:00     316
285     KEMBLE  DAK     DISPLAY     2008-06-21 00:00:00     316

Here is the Website Link, to the .xls i.e. Excel Document File :-

http://web.archive.org/web/20090804234934/http://www.raf.mod.uk/bbmf/rafcms/mediafiles/F0ED6EA8_1143_EC82_2E4534A1036AA506.xls

You will obviously need to change the following in my Python Code, to whatever you call the .xls File. And the path, of where you save it, on your Computer :-

xls = pd.ExcelFile(r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')

I have changed the end bit of the Code to :-

selected = df.loc[df['A/C'] == 'DS', 'DH', 'DHS']
groupby_venue_date = selected.groupby(['Venue', 'BID', 'DISPLAY/'])
aircraft = groupby_venue_date['A/C std'].apply(''.join).rename('Aircraft-combined')
print(aircraft.shape)
pd.DataFrame(aircraft)

But get a :- IndexingError: Too many indexers message, when I run the Code, what does that mean ? And what has caused the Error Bill ?

This is the Code I am currently running as of 2nd January 2020 :-

import pandas as pd

xls = pd.ExcelFile(r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF Display Schedule 2009.xls')

data = pd.read_excel(xls, sheet_name="Sheet1")

pd.options.display.max_rows = 1000

df = pd.DataFrame(data, columns= ['Venue','A/C','DISPLAY/','Date','BID'])

#df[(df['Venue'].str.contains('[a-zA-Z]') & (df['DISPLAY/'].str.contains('DISPLAY') & df['A/C'].str.contains("DHS|DAK|HS|SPIT")) & (df['A/C'] != 'LHS') & (df['A/C'] != 'LANC'))] 
df["Date"].fillna("No Date", inplace = True)

df['A/C'].unique().tolist()

rename_map = {
'DAK': 'D',
'SPIT': 'S',
'LANC': 'L',
'HURRI': 'H',
'PARA': 'P'
}
df['A/C std'] = df['A/C'].replace(rename_map)
print(df['A/C std'].unique().tolist())

#selected = df.loc[df['A/C'] == 'DS', 'DH', 'DHS']
selected = df.loc[df['DISPLAY/'] == 'DISPLAY']

groupby_venue_date = selected.groupby(['Venue', 'BID', 'Date', 'DISPLAY/']) 
aircraft = groupby_venue_date['A/C std'].apply(''.join).rename('Aircraft-combined')
print(aircraft.shape)
pd.DataFrame(aircraft)
10
  • Sounds like you want to remove some duplicates which are identical except in the 'A/C' column. Is that right? But what is the logic for the replacement values 'DS' and 'DHS' which appear in that column after the merge? Commented Jan 1, 2020 at 20:48
  • Also, could you provide a sample of the input data? Either part of the csv file or part of df maybe. Then we can run your script to see what is happening. Commented Jan 1, 2020 at 20:58
  • Your nearly right Bill, I actually want to keep the duplicates which are identical, except in the 'A/C' Column. Answering your other point, DS stands for Dakota and Spitfire. And DHS stands for Dakota Spitfire and Hurricane. Commented Jan 1, 2020 at 20:59
  • Shall I provide the Website Link, to the xls File, so it can be downloaded ? Commented Jan 1, 2020 at 21:01
  • 1
    Hi Bill, Here is the Website Link, to the .XLS File, i.e. Excel Document File :- web.archive.org/web/20090804234934/http://www.raf.mod.uk/bbmf/… Commented Jan 1, 2020 at 21:16

1 Answer 1

0

I'm not sure I understand exactly what you want to do but I'll try to help by providing some techniques that might help you figure it out.

For example, getting a list of the unique values for a column:

df['A/C'].unique().tolist()

[nan, 'L', 'S', 'H', 'LHS', 'LANC', 'DAK', 'SPIT', 'HS', 'HURRI', 'PARA', 'LSSD', 'LSS', 'SS', 'LH', 'DH', 'DHS', 'SSSHH']

Part of the problem appears to be dealing with these short-hand entries which are combinations of different aircraft. E.g. you said 'DHS' stands for Dakota, Spitfire, and Hurricane. It might be better to deal these non-standard values first before trying to merge the rows. One way is to replace all non-standard values using a dictionary.

For example

rename_map = {
    'DAK': 'D',
    'SPIT': 'S',
    'LANC': 'L',
    'HURRI': 'H',
    'PARA': 'P'
}
df['A/C std'] = df['A/C'].replace(rename_map)
print(df['A/C std'].unique().tolist())

[nan, 'L', 'S', 'H', 'LHS', 'D', 'HS', 'P', 'LSSD', 'LSS', 'SS', 'LH', 'DH', 'DHS', 'SSSHH']

You can then do whatever it is you want. For example, select a sub-set of the data:

selected = df.loc[df['DISPLAY/'] == 'DISPLAY']
assert selected.shape == (202, 6)

And then group rows by selected columns and joining the aircraft codes using the string-join method:

groupby_venue_date = selected.groupby(['Venue', 'Date'])
aircraft = groupby_venue_date['A/C std'].apply(''.join).rename('Aircraft-combined')
assert aircraft.index.duplicated().sum() == 0
print(aircraft.shape)
print(aircraft.head())

(89,)
Venue     Date      
AUDLEM    2008-07-26      S
AYLSHAM   2008-08-31    LHS
BEAULIEU  2008-05-25      H
BELTRING  2008-07-26      L
BENSON    2008-08-27    LHS
Name: Aircraft-combined, dtype: object

Some of the values have been joined:

print(aircraft.unique().tolist())
['S', 'LHS', 'H', 'L', 'D', 'HS', 'HSD', 'SLH', 'DHS', 'SD', 'SSSHH', 'LH', 'DS', 'DH', 'HSL']

UPDATE

You can do other operations on these codes by making functions and using the apply method.

For example sorting the string, or removing duplicated characters (which also happens to sort them).

def sorted_string(s):
    return ''.join(sorted(s))

def remove_duplicate_chars(s):
    return ''.join(set(s))

aircraft = aircraft.apply(remove_duplicate_chars)
print(aircraft.unique().tolist())

['S', 'LHS', 'H', 'L', 'D', 'HS', 'DHS', 'DS', 'LH', 'DH']
Sign up to request clarification or add additional context in comments.

19 Comments

Thankyou so much Bill, for all your help this evening. Just wondering what do the following lines of Code mean ? :- assert selected.shape == (202, 6) + assert aircraft.index.duplicated().sum() == 0 ?
And what would I need to type, so that all, the DHS, DS, DH, are displayed ? Is it possible to have the Data displayed changed to the latest, with the original Font, instead of Boolean Type Text ?
You can ignore those. Just checks to make sure the number of rows and columns is correct and to demonstrate that after the groupby there are no duplicates left. You don't need them.
I.e. showing, like when you run the original Python Code, in Jupyter Notebook, only with the current changes ?
I think you're referring to the print(aircraft.head()) statement. aircraft is pd.Series not a DataFrame. That is why it displays like that. To see the 'pretty' version in a Jupyter notebook use pd.DataFrame(aircraft)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.