I'm passing a data table to a function I've defined where I want the function to replicate rows that meet a certain condition and the return the updated data table. I'm having trouble constructing it within the function using the passed column names, largely because I don't fully understand the scoping in relation to the environment and the right way to reference the column names within the data.table within the function environment, so I'm just throwing random eval() and get() hoping something will work but to no avail so far.
Here's a MWE to get a sense:
dt <- data.table(region_name = rep(paste0("town_",1:100),3), yr = c(rep(2000,100),rep(2010,100), rep(2023,100)),
pop = c(round(runif(98, 1250,3000)),50000,120000, round(runif(97, 1300, 3500)),75103,159382,194013, round(runif(96,2000,5000)),38492,98418,154923,201348))
dt[, pop_total := lapply(.SD, sum, na.rm = T), .SDcol = "pop", by = "yr"]
dt[, pop_pctl := pop_total/100]
and the function defined like so:
foo <- function(input_dt, pop_col = "", pop_pctl_col = "", loc_col = ""){
dt <- setDT(copy(input_dt))
# get ceiling because need integer multiples for copies
dt[get(pop_col) > get(pop_pctl_col), pop_multiple := ceiling(get(pop_col)/get(pop_pctl_col))]
dt[rep(!is.na(eval(pop_multiple)), get(pop_multiple)), eval(loc_col) := paste0(get(loc_col),1:pop_multiple)]
return(dt)
}
when I pass the test dt through the function I get the following error:
test <- foo(dt, pop_col = "pop", pop_pctl_col = "pop_pctl", loc_col = "region_name")
Error in .checkTypos(e, names_x) : invalid 'times' argument
and I've tried different combinations of, e.g., dt[rep(!is.na(eval(pop_multiple)), eval(pop_multiple)), eval(loc_col) := paste0(get(loc_col),1:pop_multiple)] (i.e. swapping some of the get() for eval() in the i of the data table), but the combinations don't work that I've tried and the real issue is I'm not sure the right way to use get() and eval() in these contexts.
From what I understand the most efficient way to replicate values in a dt is to do dt[rep(var1,var2)] and that's the form that I'm trying to follow.
I want to replicate the rows that have a defined pop_multiple value, creating a number of copies of that row that is equal to the pop_multiple value, and then afterwards add a column that is simply an index of the copy of that row.
This seems to do what I want:
dt[!is.na(pop_multiple),cbind(.SD,dup=1:pop_multiple), by = "pop_multiple"]
but I'm curious if its less efficient because I was under the impression that the preferred method is that dt[rep(var1,var2)] format.