2

Updated to a more realistic example; this time added duplicates in interp_b.

I am trying to populate a field in one dataframe (interp_b) using the values from a second dataframe (bait). I want to look at each row's obs_datetime in interp_b, and determine when that plot-station-year was last baited, prior to the obs_datetime. This will later be used to calculate a time-since-bait for each obs_datetime. Bait times are in the bait dataframe in column bait_datetime. The results should go in a field called latestbait_datetime in the interp_b dataframe.

I was visualizing an iterative process where interp_b "latestbait_datetime" keeps getting recalculated until the last row in the bait dataframe is reached. The for-loop I tried is clearly running through the rows and doing the specified calculations but I can't seem to get the output in the format I want; it is producing output for each loop rather than rewriting and updating the interp_b dataframe.

Here is some code to build the two dataframes; interp_b and bait (please excuse the inelegance)

# interp_b dataframe----

   structure(list(plot_station_year = c("Cow_C2_2019", "RidingStable_C3_2018", 
"RidingStable_C3_2018", "Raf_C1_2018", "Metcalfe_C2_2019"), obs_datetime = structure(c(1559487600, 
1544954400, 1541084400, 1515160800, 1567756800), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), latestbait_datetime = structure(c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_), class = c("POSIXct", 
"POSIXt"))), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L))

enter image description here

# bait dataframe----

    structure(list(plot_station_year = c("Cow_C2_2019", "Cow_C2_2019", 
"RidingStable_C3_2018", "Raf_C1_2018"), bait_datetime = structure(c(1557500400, 
1559746800, 1543676400, 1491318000), class = c("POSIXct", "POSIXt"
), tzone = "UTC")), class = c("spec_tbl_df", "tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -4L), spec = structure(list(
    cols = list(plot_station_year = structure(list(), class = c("collector_character", 
    "collector")), bait_datetime = structure(list(format = "%d-%m-%Y %H:%M"), class = c("collector_datetime", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))

enter image description here

and the desired result would look like this

enter image description here

Below are two of my attempts. The first resulted in a dataframe that only contained the final run of the loop and the second attempt resulted in a dataframe containing all of the run results (as expected with the bind).

library(tidyverse)

#attempt #1----
    for (i in 1:nrow(bait)) { 

  print(paste("row =",i))

  interpbait <- interp_b %>% 
    mutate(latestbait_datetime = if_else((plot_station_year == bait$plot_station_year[i] & (obs_datetime >= bait$bait_datetime[i] & (is.na(latestbait_datetime) | latestbait_datetime < bait$bait_datetime[i]))), bait$bait_datetime[i], latestbait_datetime))

}


#attempt #2----
    resultb <- data.frame()

for (i in 1:nrow(bait)) { 

  print(paste("row =",i))

  interpbait2 <- interp_b %>% 
    mutate(latestbait_datetime = if_else((plot_station_year == bait$plot_station_year[i] & (obs_datetime >= bait$bait_datetime[i] & (is.na(latestbait_datetime) | latestbait_datetime < bait$bait_datetime[i]))), bait$bait_datetime[i], latestbait_datetime))

  resultb <- bind_rows(resultb, interpbait2)

  print(resultb)
}

Any help would be greatly appreciated.

1
  • Akrun, I am new to StackOverflow and may have accidentally deleted your valuable comment when I updated my post - sorry about that. Commented Feb 27, 2020 at 21:12

2 Answers 2

1

I'm not sure how long this will take, but here is a tidyverse solution. For each row in interp_b, we filter the bait dataframe to the correct plot_station_year, and ensure that all date-times are less than the row in interp_b. Then, we arrange the filtered bait data by descending datetime (so that the most recent dates are on top). We slice the first row of that dataframe so that we only get the most recent date. Then we "pull out" the date-time from the dataframe, and add it onto the appropriate row in interp_b.

library(tidyverse)
library(progress) # for progress bar

# create progress bar to update, so that you can estimate the amount of time it will take to finish the entire loop
pb <- progress_bar$new(total = nrow(interp_b))

for (i in 1:nrow(interp_b)) {

  last_time_baited <- bait %>% 
    #filter bait dataframe to appropriate plot, station, year based on
    # the row in interp_b
    filter(plot_station_year == interp_b$plot_station_year[i],
           # ensure all datetimes are less than that row in interp_b
           bait_datetime < interp_b$obs_datetime[i]) %>% 
    # arrange by datetime (most recent datetimes first)
    arrange(desc(bait_datetime)) %>% 
    # take the top row - this will be the most recent date-time that
    # the plot-station was baited
    slice(1) %>% 
    # "pull" that value out of the dataframe so you have a value, 
    # not a tibble
    pull(bait_datetime) # 

  # update the row in interp_b with the date_time baited
  interp_b$latestbait_datetime[i] <- last_time_baited

  pb$tick() # print progress
}

The resulting table matches your expected output (interp_b):

# A tibble: 5 x 3
  plot_station_year    obs_datetime        latestbait_datetime
  <chr>                <dttm>              <dttm>             
1 Cow_C2_2019          2019-06-02 15:00:00 2019-05-10 11:00:00
2 RidingStable_C3_2018 2018-12-16 10:00:00 2018-12-01 10:00:00
3 RidingStable_C3_2018 2018-11-01 15:00:00 NA                 
4 Raf_C1_2018          2018-01-05 14:00:00 2017-04-04 11:00:00
5 Metcalfe_C2_2019     2019-09-06 08:00:00 NA  
Sign up to request clarification or add additional context in comments.

1 Comment

Hi Nova, appears to be running smoothly on a small sub-sample and I will try a much larger sample tomorrow. thanks!
0

You could perform an outer join with data.table, and then select the highest bait_datetime for each plot_station_year.

Edit: I edited my answer to reflect the possibility that there could be multiple obs_datetime for a given unique plot_station_year in interp2. To preserve these, we index them and include the index in the filtering step.

One potential improvement with large files (not tested) could be to merge using roll, instead of performing an outer merge and then to filter.

That version is shown in the end of the reproducible example:

library(data.table)

interp2 <- structure(list(plot_station_year = c("Cow_C2_2019", "Cow_C2_2019", "RidingStable_C3_2018", 
    "Raf_C1_2018", "Metcalfe_C2_2019"), obs_datetime = structure(c(1559487600, 1559487300,
        1544954400, 1515160800, 1567756800), class = c("POSIXct", "POSIXt"
        ), tzone = "UTC"), latestbait_datetime = structure(c(NA_real_, 
            NA_real_, NA_real_, NA_real_), class = c("POSIXct", "POSIXt"))), class = c("spec_tbl_df", 
                "tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L))

bait2 <- structure(list(plot_station_year = c("Cow_C2_2019", "Cow_C2_2019",  "Cow_C2_2019",
    "RidingStable_C3_2018", "Raf_C1_2018"), bait_datetime = structure(c(1557500400, 
        1496674800, 1576674800, 1543676400, 1491318000), class = c("POSIXct", "POSIXt"
        ), tzone = "UTC")), class = c("spec_tbl_df", "tbl_df", "tbl", 
            "data.frame"), row.names = c(NA, -5L), spec = structure(list(
                cols = list(plot_station_year = structure(list(), class = c("collector_character", 
                    "collector")), bait_datetime = structure(list(format = "%d-%m-%Y %H:%M"), class = c("collector_datetime", 
                        "collector"))), default = structure(list(), class = c("collector_guess", 
                            "collector")), skip = 1), class = "col_spec"))


# add index idx by plot_station_year, remove empty column, set keys
setDT(interp2)[, "latestbait_datetime" := NULL][, idx := 1:.N, by=plot_station_year]
setkeyv(interp2, c("plot_station_year", "idx", "obs_datetime"))

# same for bait2: set as data.table, set keys
setDT(bait2, key=c("plot_station_year", "bait_datetime"))

## option 1: merge files, then filter
# outer join on interp2 and bait2 on first column (and order by bait_datetime)
expected_out <- merge(interp2, bait2, by="plot_station_year", all=TRUE)

# set keys for sorting
setkey(expected_out, plot_station_year, idx, bait_datetime)

# select highest bait_datetime below obs_datetime by plot_station_year and idx
expected_out <- expected_out[is.na(bait_datetime) | bait_datetime < obs_datetime][,
    tail(.SD, 1), by=.(plot_station_year, idx)]

# rename and sort columns
setnames(expected_out, old="bait_datetime", new="latestbait_datetime")
setorder(expected_out, -latestbait_datetime, idx, na.last = TRUE)[]
#>       plot_station_year idx        obs_datetime latestbait_datetime
#> 1:          Cow_C2_2019   1 2019-06-02 15:00:00 2019-05-10 15:00:00
#> 2:          Cow_C2_2019   2 2019-06-02 14:55:00 2019-05-10 15:00:00
#> 3: RidingStable_C3_2018   1 2018-12-16 10:00:00 2018-12-01 15:00:00
#> 4:          Raf_C1_2018   1 2018-01-05 14:00:00 2017-04-04 15:00:00
#> 5:     Metcalfe_C2_2019   1 2019-09-06 08:00:00                <NA>


## option 2 (might use less memory): rolling join

bait2[, latestbait_datetime := bait_datetime]
out_alt <- bait2[interp2, .(plot_station_year, obs_datetime, idx, latestbait_datetime), 
    on=c("plot_station_year", "bait_datetime==obs_datetime"), roll=Inf]

# order
setorder(out_alt, -latestbait_datetime, idx, na.last = TRUE)[]
#>       plot_station_year        obs_datetime idx latestbait_datetime
#> 1:          Cow_C2_2019 2019-06-02 15:00:00   1 2019-05-10 15:00:00
#> 2:          Cow_C2_2019 2019-06-02 14:55:00   2 2019-05-10 15:00:00
#> 3: RidingStable_C3_2018 2018-12-16 10:00:00   1 2018-12-01 15:00:00
#> 4:          Raf_C1_2018 2018-01-05 14:00:00   1 2017-04-04 15:00:00
#> 5:     Metcalfe_C2_2019 2019-09-06 08:00:00   1                <NA>
setcolorder(out_alt, c(1,3,2,4))[]
#>       plot_station_year idx        obs_datetime latestbait_datetime
#> 1:          Cow_C2_2019   1 2019-06-02 15:00:00 2019-05-10 15:00:00
#> 2:          Cow_C2_2019   2 2019-06-02 14:55:00 2019-05-10 15:00:00
#> 3: RidingStable_C3_2018   1 2018-12-16 10:00:00 2018-12-01 15:00:00
#> 4:          Raf_C1_2018   1 2018-01-05 14:00:00 2017-04-04 15:00:00
#> 5:     Metcalfe_C2_2019   1 2019-09-06 08:00:00                <NA>

## test that both options give the same result:

identical(expected_out, out_alt)
#> [1] TRUE

4 Comments

Thanks user12728748. I will take a look at your solution. In the actual database interp_b has 3.5 million records/rows and each plot has multiple obs_datetimes such that I am not looking for the largest value but the value that is closest to but prior to the obs_datetime. This is a database of trail camera observations and the second data table (bait) has a list of the times each station was baited. In the end what I will have is a time-since-bait value for each observation.
I also updated the names of the dataframes in my example to match my code.
If "each plot has multiple obs_datetimes" means plot_station_year in interp2 is not unique, and you want to keep them all, you need to add an index and include it when selecting the value that is closest to but prior to the obs_datetime. You should provide an example that makes that clear. I could edit the answer to reflect that, then.
Well, if you want to filter on largest value below obs_datetime and preserve the stations with NA values, just replace expected_out <- expected_out[, tail(.SD, 1), by=plot_station_year] with expected_out <- expected_out[is.na(bait_datetime) | bait_datetime < obs_datetime][, tail(.SD, 1), by=plot_station_year].

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.