I have two datasets, which I want to first join and then filter based on conditions, for each group (id, code). For each group, if the end_date is earlier than date, I want to simply filter for the latest end_date. Otherwise, I want to filter for the rows where date is between two columns, start_date and end_date.
I have coded this using dplyr and it works - see below.
left_join(df, df_match, by='id') %>%
group_by(id, code) %>%
mutate(is.max = max(end_date) < date) %>%
filter(case_when(
is.max == T ~ end_date == max(end_date),
is.max == F ~ date >= start_date & date <= end_date
))
However, this code is very slow for my 1+ million row datasets. I am curious if it's possible to achieve the same thing using data.table, which is usually much faster?