0

I'm trying to implement a condition where if the count of incorrect values is greater than 2 (2019-05-17 & 2019-05-20 in the example below) then the complete date (all the time blocks) is removed

Input

                    t_value C/IC
2019-05-17 00:00:00   0     incorrect
2019-05-17 01:00:00   0     incorrect 
2019-05-17 02:00:00   0     incorrect 
2019-05-17 03:00:00   4     correct
2019-05-17 04:00:00   5     correct 
2019-05-18 01:00:00   0     incorrect   
2019-05-18 02:00:00   6     correct  
2019-05-18 03:00:00   7     correct 
2019-05-19 04:00:00   0     incorrect
2019-05-19 09:00:00   0    incorrect 
2019-05-19 11:00:00   8    correct
2019-05-20 07:00:00   2    correct
2019-05-20 08:00:00   0    incorrect
2019-05-20 09:00:00   0    incorrect
2019-05-20 07:00:00   0    incorrect 

Desired Output

                    t_value C/IC 
2019-05-18 01:00:00   0     incorrect   
2019-05-18 02:00:00   6     correct  
2019-05-18 03:00:00   7     correct 
2019-05-19 04:00:00   0     incorrect
2019-05-19 09:00:00   0    incorrect 
2019-05-19 11:00:00   8    correct

I'm not sure which time based operation to perform to get the desired result. Thanks

2
  • Seems like all you need is records with datetime between 2019-05-17 04:00:00 and 2019-05-19 11:00:00. Pandas.Timestamp() allows you to compare the dates with simple >, <, == operations. Commented May 18, 2020 at 4:38
  • Yes, in this example. But overall, I'm concerned with removing the date where the corresponding count of incorrect values is greater than 2. Commented May 18, 2020 at 4:46

2 Answers 2

1
#read in data
df = pd.read_csv(StringIO(data),sep='\s{2,}', engine='python')

#give index a name 
df.index.name = 'Date'
#convert to datetime 
#and sort index
#usually safer to sort datetime index in Pandas
df.index = pd.to_datetime(df.index)
df = df.sort_index()

res = (df
       #group by date and c/ic
       .groupby([pd.Grouper(freq='1D',level='Date'),"C/IC"])
       .size()
       #get rows greater than 2 and incorrect
       .loc[lambda x: x>2,"incorrect"]
       #keep only the date index
       .droplevel(-1)
       .index
       #datetime information trapped here
       #and due to grouping, it is different from initial datetime
       #as such, we convert to string 
       #and build another batch of dates
       .astype(str)
       .tolist()
      )

res
['2019-05-17', '2019-05-20']

#build a numpy array of dates
idx = np.array(res, dtype='datetime64')

#exclude dates in idx and get final value
#aim is to get dates, irrespective of time

df.loc[~np.isin(df.index.date,idx)]

                     t_value    C/IC
Date        
2019-05-18 01:00:00     0   incorrect
2019-05-18 02:00:00     6   correct
2019-05-18 03:00:00     7   correct
2019-05-19 04:00:00     0   incorrect
2019-05-19 09:00:00     0   incorrect
2019-05-19 11:00:00     8   correct
Sign up to request clarification or add additional context in comments.

Comments

0

Misunderstood the question, sorry.

Updated answer: you can find the dates to be removed by the following:

df['_date'] = df.index.dt.date
incorrect_df = df[df['C/IC'] == 'incorrect']
incorrect_count = incorrect_df['C/IC'].groupby(by='_date').count()
dates_to_remove = set(incorrect_count[incorrect_count > 2]['_date'])
    # using set to make the later step more efficient if the df is long

Then mask the dataframe accordingly:

mask = [x not in dates_to_remove for x in df['_date']
res = df[mask]

2 Comments

Thanks for responding. I don't think this would remove the date with all the time blocks.
Yeah sorry I missed that. You can use df.index.dt.date first to take the dates only and save it to a separate column. The answer is now updated.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.