1

I'm passing a data table to a function I've defined where I want the function to replicate rows that meet a certain condition and the return the updated data table. I'm having trouble constructing it within the function using the passed column names, largely because I don't fully understand the scoping in relation to the environment and the right way to reference the column names within the data.table within the function environment, so I'm just throwing random eval() and get() hoping something will work but to no avail so far.

Here's a MWE to get a sense:

dt <- data.table(region_name  = rep(paste0("town_",1:100),3), yr  = c(rep(2000,100),rep(2010,100), rep(2023,100)),
                 pop = c(round(runif(98, 1250,3000)),50000,120000, round(runif(97, 1300, 3500)),75103,159382,194013, round(runif(96,2000,5000)),38492,98418,154923,201348))
dt[, pop_total := lapply(.SD, sum, na.rm = T), .SDcol = "pop", by = "yr"]
dt[, pop_pctl := pop_total/100]

and the function defined like so:

foo <- function(input_dt, pop_col = "", pop_pctl_col = "", loc_col = ""){
  dt <- setDT(copy(input_dt))
  
  # get ceiling because need integer multiples for copies
  dt[get(pop_col) > get(pop_pctl_col), pop_multiple := ceiling(get(pop_col)/get(pop_pctl_col))]
  dt[rep(!is.na(eval(pop_multiple)), get(pop_multiple)), eval(loc_col) := paste0(get(loc_col),1:pop_multiple)]
 
  
  return(dt)
}

when I pass the test dt through the function I get the following error:

test <- foo(dt, pop_col = "pop", pop_pctl_col = "pop_pctl", loc_col = "region_name")
Error in .checkTypos(e, names_x) : invalid 'times' argument

and I've tried different combinations of, e.g., dt[rep(!is.na(eval(pop_multiple)), eval(pop_multiple)), eval(loc_col) := paste0(get(loc_col),1:pop_multiple)] (i.e. swapping some of the get() for eval() in the i of the data table), but the combinations don't work that I've tried and the real issue is I'm not sure the right way to use get() and eval() in these contexts.

From what I understand the most efficient way to replicate values in a dt is to do dt[rep(var1,var2)] and that's the form that I'm trying to follow.

I want to replicate the rows that have a defined pop_multiple value, creating a number of copies of that row that is equal to the pop_multiple value, and then afterwards add a column that is simply an index of the copy of that row.

This seems to do what I want:

dt[!is.na(pop_multiple),cbind(.SD,dup=1:pop_multiple), by = "pop_multiple"]

but I'm curious if its less efficient because I was under the impression that the preferred method is that dt[rep(var1,var2)] format.

2 Answers 2

3

I think you may want a function like the following, leveraging env=list(..):

foo <- function(dt, num, den) {
  dt[, .SD[rep(1, ceiling(x/y))], 1:nrow(dt), env=list(x=num, y=den)] |> 
    _[, index:=1:.N, nrow] |> 
    _[, nrow:=NULL]
}

Usage:

foo(dt, "pop", "pop_pctl")[]

Output:

     region_name    yr    pop pop_total pop_pctl index
          <char> <num>  <num>     <num>    <num> <int>
  1:      town_1  2000   2644    369415  3694.15     1
  2:      town_2  2000   2646    369415  3694.15     1
  3:      town_3  2000   1412    369415  3694.15     1
  4:      town_4  2000   1470    369415  3694.15     1
  5:      town_5  2000   2478    369415  3694.15     1
 ---                                                  
461:    town_100  2023 201348    820605  8206.05    21
462:    town_100  2023 201348    820605  8206.05    22
463:    town_100  2023 201348    820605  8206.05    23
464:    town_100  2023 201348    820605  8206.05    24
465:    town_100  2023 201348    820605  8206.05    25
Sign up to request clarification or add additional context in comments.

2 Comments

this is exactly what I was looking for, thank you! This is way more efficient and neater that what I would've been able to come up with, do you have any resources you can recommend to me about the piping and dt referencing nomenclature here? At least I assume |> and _[] are respectively piping and then referencing the passed dt. And this is more efficient since the function call with [] appended avoids copying/using a return statement, correct?
the use of the native pipe |> followed by the underscore (_) placeholder, I beileve only has aesthetic benefit (but others may correct me). data.table chaining without the pipe is also possibe. For example x[,z:=10][, bar(y,z), by="id"][, std(V1)]
2

You can pass additional arguments (...) to foo, convert them to a named list and provide this list as an environment to the data.table's like in this example:

foo <- function(input_dt, ...){
  dots <- list(...) ## addt. arguments to named list "dots"
  dt <- setDT(copy(input_dt))
  
  dt[pop_col > pop_pctl_col, 
     pop_multiple := ceiling(pop_col / pop_pctl_col),
     env = dots ## add "dots" to scope
  ]
}

3 Comments

However, you might rather be looking for creating some template of all desired replications (think e. g. outer and expand.grid) which you join your data to?
Hm, when I try this I get the following error: Error in substitute2(`:=`(pop_multiple, ceiling(get(pop_col)/get(pop_pctl_col))), : 'env' argument does not have names to be clear I did: env_list <- list(pop_col, pop_pctl_col,loc_col) and applied it to the dt mutation for the line: dt[rep(!is.na(eval(pop_multiple)), get(pop_multiple)), eval(loc_col) := paste0(get(loc_col),1:pop_multiple), env = env_list] since the line before that was working as expected. Looking around for information on that error doesn't seem to turn up results that ap
Realized I made a mistake and didn't make it as a named list as you said. Added names(env_list) <- c("pop_col", "pop_pctl_col", "loc_col") and then ran it and got the same error of invalid 'times' argument: Error in .checkTypos(e, names_x) : invalid 'times' argument

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.