building efficient for loop for user defined function: data.table

Question

I'm trying to build an efficient for loop for this function proposed by minem here: (Data.table: how to get the blazingly fast subsets it promises and apply to a second data.table)

My data are:

library(dplyr)
library(tidyr)
library(lubridate)
library(data.table)

adherence <- cbind.data.frame(c("1", "2", "3", "1", "2", "3"), c("2013-01-01", "2013-01-01", "2013-01-01", "2013-02-01", "2013-02-01", "2013-02-01"))
names(adherence)[1] <- "ID" 
names(adherence)[2] <- "year"
adherence$year <- ymd(adherence$year)

lsr <- cbind.data.frame(
  c("1", "1", "1", "2", "2", "2", "3", "3"), #ID
  c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05"), #eksd
  c("60", "90", "90", "60", "120", "60", "30", "90") # DDD
)
names(lsr)[1] <- "ID"
names(lsr)[2] <- "eksd"
names(lsr)[3] <- "DDD"

lsr$eksd <- as.Date((lsr$eksd))
lsr$DDD <- as.numeric(as.character(lsr$DDD))
lsr$ENDDATE <- lsr$eksd + lsr$DDD
lsr <- as.data.table(lsr)

adherence <- as.data.table(adherence)

The Function proposed by minem are:

by_minem2 <- function(dt = lsr2) {
  d <- as.numeric(as.Date("2013-02-01"))
  dt[, ENDDATE2 := as.numeric(ENDDATE)]
  x <- dt[eksd <= d & ENDDATE > d, sum(ENDDATE2 - d), keyby = ID]
  uid <- unique(dt$ID)
  id2 <- setdiff(uid, x$ID)
  id2 <- uid[!(uid %in% x$ID)]
  x2 <- data.table(ID = id2, V1 = 0)
  x <- rbind(x, x2)
  setkey(x, ID)
  x
}

This returns:

> by_minem2(lsr)
   ID V1
1:  1 64
2:  2  0
3:  3 63

For the loop i need to include information about which time I evaluated at so the ideal repeated output looks like this:

cbind(as.Date("2013-02-01"),by_minem2(lsr))

I then want to repeat this for different dates a few hundred times putting everything into the same data.table:

time.months <- as.Date("2013-02-01")+(365.25/12)*(0:192) #dates to evaluate at

I'm trying to do this with a for loop like this:

     for (d in min(time.months):max(time.months))
{
  by_minem <- function(dt = lsr2) {
    d <- as.numeric(d)
    dt[, ENDDATE2 := as.numeric(ENDDATE)]
    x <- dt[eksd <= d & ENDDATE > d, sum(ENDDATE2 - d), keyby = ID]
    uid <- unique(dt$ID)
    id2 <- setdiff(uid, x$ID)
    id2 <- uid[!(uid %in% x$ID)]
    x2 <- data.table(ID = id2, V1 = 0)
    x <- rbind(x, x2)
    setkey(x, ID)
    xtot <- append(xtot,x) 
    xtot <- cbind(d, xtot) # i need to know time of evaluation
    xtot
  }
}

Community · Accepted Answer · 2020-06-20 09:12:55Z

As indicated in the answer to the related question Data.table: how to get the blazingly fast subsets it promises and apply to a second data.table, this can be solved by updating in a non-equi join which is possible with data.table.

The difference to the linked question is that here we need to create the cross join CJ() of all unique IDs with the vector of dates on our own before joining with lsr.

The OP has provided a series of dates time.months whose defintion

time.months <- as.Date("2013-02-01")+(365.25/12)*(0:192) #dates to evaluate at

leads to "crooked" dates which is only visible if coerced to numeric or POSIXct:

head(lubridate::as_datetime(time.months))

[1] "2013-02-01 00:00:00 UTC" "2013-03-03 10:30:00 UTC" "2013-04-02 21:00:00 UTC"
[4] "2013-05-03 07:30:00 UTC" "2013-06-02 18:00:00 UTC" "2013-07-03 04:30:00 UTC"

The issue is that these "dates" are not aligned with midnight but start somewhere during the day. To avoid these ambiguities, the seq() function can be used

dates <- seq(as.Date("2013-02-01"), length.out = 193, by = "month")

which creates a series of dates starting on the first day of each month.

In addition, data.table's IDate class is used which stores dates as integers (4 bytes) instead of double (8 bytes). This saves memory as well as processing time because the usually faster integer arithmetic can be used.

# coerce Date to IDate
idates <- as.IDate(dates)
setDT(lsr)[, eksd := as.IDate(eksd)][, ENDDATE := as.IDate(ENDDATE)]

# cross join unique IDs with dates 
CJ(ID = lsr$ID, date = idates, unique = TRUE)[
  # intialize result column
  , AH := 0L][
    # non-equi join and ...
    lsr, on = .(ID, date >= eksd, date < ENDDATE), 
    # ... update only matching rows
    AH := as.integer(ENDDATE - x.date)][
      # reshape from long to wide format
      , dcast(.SD, ID ~ date)]

    ID 2013-02-01 2013-03-01 2013-04-01 2013-05-01 2013-06-01 2013-07-01 2013-08-01 [...]
1:  1         64         36          5          0          0          0          0
2:  2          0          0        110         80         49         19          0
3:  3         63         35          4          0          0          0          0

Caveat

Note that above code assumes that the intervals [eksd, ENDDATE) for each ID do not overlap. This can be verified by

lsr[order(eksd), all(eksd - shift(ENDDATE, fill = 0) > 0), keyby = ID]

   ID   V1
1:  1 TRUE
2:  2 TRUE
3:  3 TRUE

In case there are overlaps, the above code can be modified to aggregate within the non-equi join using by = .EACHI.

Benchmark

In another related question data.table by = xx How do i keep the groups of length 0 when i returns no match, the OP has pointed out that performance is crucial due to the size of his production data.

According to OP's comment, lsr has 20 mio rows and 12 columns, the adherence dataset, that I'm trying not to use has 1,5 mio rows of 2 columns. In another question, the OP mentions that lsr is a few hundred mio. rows.

@minem has responded to this by providing a benchmark in his answer. We can use this benchmark data to compare the different answers.

# create benchmark data
lsr <- data.frame(
  ID = c("1", "1", "1", "2", "2", "2", "3", "3"),
  eksd = as.Date(c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05")),
  DDD = as.integer(c("60", "90", "90", "60", "120", "60", "30", "90")),
  stringsAsFactors = FALSE)
lsr$ENDDATE <- lsr$eksd + lsr$DDD
n <- 5e4
lsr2 <- lapply(1:n, function(x) lsr)
lsr2 <- rbindlist(lsr2, use.names = T, fill = T, idcol = T)
lsr2[, ID := as.integer(paste0(.id, ID))]

Thus, the benchmark dataset consists of 400 k rows and 150 k unique IDs:

lsr2[, .(.N, uniqueN(ID))]

        N     V2
1: 400000 150000

# pull data preparation out of the benchmark 
lsr2i <- copy(lsr2)[, eksd := as.IDate(eksd)][, ENDDATE := as.IDate(ENDDATE)]
lsr2[, ENDDATE2 := as.numeric(ENDDATE)]

# define date series
dates <- seq(as.Date("2013-02-01"), length.out = 193, by = "month")
idates <- seq(as.IDate("2013-02-01"), length.out = 193, by = "month")

# run benchmark
library(microbenchmark)
bm <- microbenchmark(
  minem = {
    dt <- copy(lsr2)
    xtot <- lapply(dates, function(d) {
      d <- as.numeric(d)
      x <- dt[eksd <= d & ENDDATE > d, sum(ENDDATE2 - d), keyby = ID]
      uid <- unique(dt$ID)
      id2 <- setdiff(uid, x$ID)
      id2 <- uid[!(uid %in% x$ID)]
      if (length(id2) > 0) {
        x2 <- data.table(ID = id2, V1 = 0)
        x <- rbind(x, x2)
      }
      setkey(x, ID)
      x
    })
    for (x in seq_along(xtot)) {
      setnames(xtot[[x]], c("ID", paste0("V", x)))
    }
    xtot <- Reduce(function(...) merge(..., all = TRUE, by = "ID"), xtot)
    xtot
  },
  uwe = {
    dt <- copy(lsr2i)
    CJ(ID = dt$ID, date = idates, unique = TRUE)[, AH := 0L][
      dt, on = .(ID, date >= eksd, date < ENDDATE), 
      AH := as.integer(ENDDATE - x.date)][, dcast(.SD, ID ~ date)]
  },
  times = 1L
)
print(bm)

The result for one run shows that the non-equi join is more than 4 times faster than the lapply() approach.

Unit: seconds
  expr       min        lq      mean    median        uq       max neval
 minem 27.654703 27.654703 27.654703 27.654703 27.654703 27.654703     1
   uwe  5.958907  5.958907  5.958907  5.958907  5.958907  5.958907     1

minem · Accepted Answer · 2018-01-19 11:38:16Z

1

something like this :

dt <- lsr
dt[, ENDDATE2 := as.numeric(ENDDATE)]
s <- time.months
xtot <- lapply(s, function(d) {
  d <- as.numeric(d)
  x <- dt[eksd <= d & ENDDATE > d, sum(ENDDATE2 - d), keyby = ID]
  uid <- unique(dt$ID)
  id2 <- setdiff(uid, x$ID)
  id2 <- uid[!(uid %in% x$ID)]
  if (length(id2) > 0) {
    x2 <- data.table(ID = id2, V1 = 0)
    x <- rbind(x, x2)
  }
  setkey(x, ID)
  x
})
for (x in seq_along(xtot)) {
  setnames(xtot[[x]], c("ID", paste0("V", x)))
}

xtot <- Reduce(function(...) merge(..., all = TRUE, by = "ID"), xtot)
xtot

answered Jan 19, 2018 at 11:38

minem

3,6502 gold badges19 silver badges31 bronze badges

3 Comments

Jakn09ab Over a year ago

Let me understand this: The columns in the output besides ID (V1-V11) are the times evaluated at? so those names could be replaced, in the same order, as those in time.months? Why do you reassign? lsr and time.months?

minem Over a year ago

@Jakn09ab Yes, each columns is sum(ENDDATE2 - d) for each date in times.month. You can replace the column names as you wish

Jakn09ab Over a year ago

I got some issues when implementing on my real data: uid <- unique(dt$ID was not an S4 class, but I just changed to bid <- dt[,unique(ID)]. Then it worked using time.months of length 3. I'm running it now with full length, I'll check in a few hours and report any other adjustments necessary for implementing this on the full dataset, in case anyone tries to use this at a later time.

Collectives™ on Stack Overflow

building efficient for loop for user defined function: data.table

2 Answers 2

Caveat

Benchmark

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Caveat

Benchmark

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related