1

I have a CSV dataset and I need to filter it with conditions but the problem is that the condition can be true for multiple days. What I want is to keep the last true value for this condition.

My dataset looks like this

Date           City        Summary       FlightNo.   Terminal     Company
2-18-2019       NY        Airplane Land      23          7         Delta 
2-18-2019     London     Cargo handling      4           5         British
2-18-2019      Dubai     Airplane land       92          7         Emirates
2-19-2019      Dubai     Airplane stay       92          5         Emirates
2-19-2019      Paris     Flight cancel       78          2         British
2-19-2019     London     Airplane Land       4           5         British
2-19-2019       LA       Airplane Land       7           2         United
2-20-2019      Dubai     Airplane land       92          3         Emirates
2-20-2019       LA       Airplane land       29          3         Delta
2-20-2019       NY       Airplane left       23          1         Delta
2-21-2019      Paris     Airplane reschedu   78          2         British
2-21-2019      London    Airplane land       4           3         British
2-21-2019       LA    Airplane from NY land  29          5         Delta
~~~
3-10-2019      London    Airplane land       5           5         KLM
3-10-2019      Paris     Airplane Land       78          7         AirFrance
3-10-2019       LA       Reschedule          29          4         United
3-11-2019       NY       Cargo handled       23          7         Delta
3-11-2019      Dubai     Arrived be4 2 day   34          7         Etihad
~~~
3-21-2019      Dubai      Airplane land      92          5         Emirates
3-21-2019     New Delhi   Reschedule         9           4         AirAsia
3-21-2019      London     Cargo handling     5           2         Lufthansa
3-22-2019     New Delhi   Airplane Land      9           3         AirAsia
3-22-2019       NY        Reschedule         23          2         United
3-22-2019      Dubai      Airplane land      35          1         Emirates

So the code should give us the last entry for plane landing where City == City and Flight No. == Flight No and Company == Company. As you can see this condition can be true for multiple days. So If all the three conditions are true and Summary contains Airplane Lands return the last true entire

Edited The desired output should look like the dataset below:

Date           City        Summary       FlightNo.   Terminal     Company
2-18-2019       NY       Airplane Land       23          7         Delta 
2-19-2019       LA       Airplane Land       7           2         United
2-20-2019      Dubai     Airplane land       92          3         Emirates
2-21-2019      London    Airplane Land       4           3         British
2-21-2019       LA    Airplane from NY land  29          5         Delta
~~~
3-10-2019      London    Airplane land       5           5         KLM
3-10-2019      Paris     Airplane Land       78          7         AirFrance
~~~
3-21-2019      Dubai      Airplane land      92          5         Emirates
3-22-2019     New Delhi   Airplane Land      9           3         AirAsia
3-22-2019      Dubai      Airplane land      35          1         Emirates

As shown above to delete row all three columns(City, FlightNo., and Company) should be the same if any of them is different then both rows should be kept.

The logic of it: Condition1: If df[Summary] contains "Airplane" and "land" return the row Condition2: Frome the already filtered dataset If df[City] == df[City] and df[FlightNo.] == df[FlightNo.] and df[Company] == df[Company] is true with 3 days then keep either the last or the first. So if returns rows with airplane land in the same city with same flight number runned by the same company on the 18th and 20th then one day row should be kept only. But if it was on the 1st and 15th from the same month then keep both rows.

Please help me find a what to apply all condition and keep the last True entrie.

EDIT:

Keep first if condition are true in the next 3 days Input

print (df)
     Date      City Code      Summary      Flight No.   Company
0   2-18-2019    021        Airplane land      23       Emirates
1   2-18-2019    013        Airplane land      23       Etihad
2   2-19-2019    021        Airplane land      23       Emirates
3   2-19-2019    013        Airplane Land      23       Etihad
4   2-20-2019    021        Airplane land      23       Emirates
5   2-20-2019    055        Airplane land      23       Emirates
6   2-20-2019    013        Airplane land      23       Etihad
7   2-21-2019    021        Airplane land      23       Emirates
8   2-21-2019    013        Airplane land      78       Emirates
9   2-21-2019    055  Airplane from NY land    23       Emirates
10  2-22-2019    021        Airplane land      78       Emirates
11  2-22-2019    013        Airplane Land      78       Emirates
12  2-22-2019    055        Airplane land      78       Emirates
13  2-23-2019    021        Airplane land      78       Etihad

Output:

print (df)
         Date      City Code      Summary      Flight No.   Company
    0   2-18-2019    021        Airplane land      23       Emirates
    1   2-18-2019    013        Airplane land      23       Etihad
    5   2-20-2019    055        Airplane land      23       Emirates
    7   2-21-2019    021        Airplane land      23       Emirates
    8   2-21-2019    013        Airplane land      78       Emirates
    10  2-22-2019    021        Airplane land      78       Emirates
    12  2-22-2019    055        Airplane land      78       Emirates

1 Answer 1

2

I think you need:

#convert to datetimes
df['Date'] = pd.to_datetime(df['Date'])

#sortig by datetimes
df = df.sort_values(['City Code', 'Flight No.','Company','Date'])

#filter case non sensitive
df=df[(df.Summary.str.contains('Airplane ') & df.Summary.str.contains('Land', case=False))]

s = df.groupby(['City Code', 'Flight No.','Company'])['Date'].transform('first')
#get diff by first date per groups
df['diff'] = df['Date'].sub(s).dt.days.fillna(0)
#group column each 3 days
df['g'] = (df['diff'] // 3 )
#filter 3 days window from first per groups
df = df[~df.duplicated(['City Code', 'Flight No.','Company','g'])]
print (df)
         Date City Code        Summary  Flight No.   Company
0  2019-02-18       021  Airplane land          23  Emirates
1  2019-02-18       013  Airplane land          23    Etihad
5  2019-02-20       055  Airplane land          23  Emirates
7  2019-02-21       021  Airplane land          23  Emirates
8  2019-02-21       013  Airplane land          78  Emirates
10 2019-02-22       021  Airplane land          78  Emirates
12 2019-02-22       055  Airplane land          78  Emirates
13 2019-02-23       021  Airplane land          78    Etihad
Sign up to request clarification or add additional context in comments.

43 Comments

Thank you for helping but I tried it and it filtered half of the dataset which what I do not want. If City == City and No. == No. and summary contains Airplane & Land then keep the last within two days. So if the condition is true on the 9th ,10th, 11th and 25th then it should return the rows in the 11th and 25th. So I only need to filter the repeated within 3 days if it is more than that then keep the row
@SMO - can you test now?
No city code can appear only once every day
the difference can be more than one day. so city code can appear every day , every other day or even once a year
this one works. However the previous one that contains def f(x): return (x.diff().dt.days.fillna(0).cumsum() // 3).duplicated() worked also. and it made more sense to me. I'm really thankful for your time and effort.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.