Replicate rows in data.table using passed column names in user defined function environment

Question

I'm passing a data table to a function I've defined where I want the function to replicate rows that meet a certain condition and the return the updated data table. I'm having trouble constructing it within the function using the passed column names, largely because I don't fully understand the scoping in relation to the environment and the right way to reference the column names within the data.table within the function environment, so I'm just throwing random eval() and get() hoping something will work but to no avail so far.

Here's a MWE to get a sense:

dt <- data.table(region_name  = rep(paste0("town_",1:100),3), yr  = c(rep(2000,100),rep(2010,100), rep(2023,100)),
                 pop = c(round(runif(98, 1250,3000)),50000,120000, round(runif(97, 1300, 3500)),75103,159382,194013, round(runif(96,2000,5000)),38492,98418,154923,201348))
dt[, pop_total := lapply(.SD, sum, na.rm = T), .SDcol = "pop", by = "yr"]
dt[, pop_pctl := pop_total/100]

and the function defined like so:

foo <- function(input_dt, pop_col = "", pop_pctl_col = "", loc_col = ""){
  dt <- setDT(copy(input_dt))
  
  # get ceiling because need integer multiples for copies
  dt[get(pop_col) > get(pop_pctl_col), pop_multiple := ceiling(get(pop_col)/get(pop_pctl_col))]
  dt[rep(!is.na(eval(pop_multiple)), get(pop_multiple)), eval(loc_col) := paste0(get(loc_col),1:pop_multiple)]
 
  
  return(dt)
}

when I pass the test dt through the function I get the following error:

test <- foo(dt, pop_col = "pop", pop_pctl_col = "pop_pctl", loc_col = "region_name")

Error in .checkTypos(e, names_x) : invalid 'times' argument

and I've tried different combinations of, e.g., dt[rep(!is.na(eval(pop_multiple)), eval(pop_multiple)), eval(loc_col) := paste0(get(loc_col),1:pop_multiple)] (i.e. swapping some of the get() for eval() in the i of the data table), but the combinations don't work that I've tried and the real issue is I'm not sure the right way to use get() and eval() in these contexts.

From what I understand the most efficient way to replicate values in a dt is to do dt[rep(var1,var2)] and that's the form that I'm trying to follow.

I want to replicate the rows that have a defined pop_multiple value, creating a number of copies of that row that is equal to the pop_multiple value, and then afterwards add a column that is simply an index of the copy of that row.

This seems to do what I want:

dt[!is.na(pop_multiple),cbind(.SD,dup=1:pop_multiple), by = "pop_multiple"]

but I'm curious if its less efficient because I was under the impression that the preferred method is that dt[rep(var1,var2)] format.

langtang · Accepted Answer · 2025-09-01 15:53:14Z

3

I think you may want a function like the following, leveraging env=list(..):

foo <- function(dt, num, den) {
  dt[, .SD[rep(1, ceiling(x/y))], 1:nrow(dt), env=list(x=num, y=den)] |> 
    _[, index:=1:.N, nrow] |> 
    _[, nrow:=NULL]
}

Usage:

foo(dt, "pop", "pop_pctl")[]

Output:

     region_name    yr    pop pop_total pop_pctl index
          <char> <num>  <num>     <num>    <num> <int>
  1:      town_1  2000   2644    369415  3694.15     1
  2:      town_2  2000   2646    369415  3694.15     1
  3:      town_3  2000   1412    369415  3694.15     1
  4:      town_4  2000   1470    369415  3694.15     1
  5:      town_5  2000   2478    369415  3694.15     1
 ---                                                  
461:    town_100  2023 201348    820605  8206.05    21
462:    town_100  2023 201348    820605  8206.05    22
463:    town_100  2023 201348    820605  8206.05    23
464:    town_100  2023 201348    820605  8206.05    24
465:    town_100  2023 201348    820605  8206.05    25

answered Sep 1 at 15:53

langtang

25.3k1 gold badge14 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

BLP92 Sep 2 at 11:26

this is exactly what I was looking for, thank you! This is way more efficient and neater that what I would've been able to come up with, do you have any resources you can recommend to me about the piping and dt referencing nomenclature here? At least I assume |> and _[] are respectively piping and then referencing the passed dt. And this is more efficient since the function call with [] appended avoids copying/using a return statement, correct?

langtang Sep 4 at 1:25

the use of the native pipe |> followed by the underscore (_) placeholder, I beileve only has aesthetic benefit (but others may correct me). data.table chaining without the pipe is also possibe. For example x[,z:=10][, bar(y,z), by="id"][, std(V1)]

I_O · Accepted Answer · 2025-09-01 10:20:35Z

2

You can pass additional arguments (...) to foo, convert them to a named list and provide this list as an environment to the data.table's like in this example:

foo <- function(input_dt, ...){
  dots <- list(...) ## addt. arguments to named list "dots"
  dt <- setDT(copy(input_dt))
  
  dt[pop_col > pop_pctl_col, 
     pop_multiple := ceiling(pop_col / pop_pctl_col),
     env = dots ## add "dots" to scope
  ]
}

answered Sep 1 at 10:20

I_O

7,1812 gold badges11 silver badges23 bronze badges

3 Comments

I_O Sep 1 at 10:23

However, you might rather be looking for creating some template of all desired replications (think e. g. outer and expand.grid) which you join your data to?

BLP92 Sep 1 at 11:21

Hm, when I try this I get the following error:

Error in substitute2(`:=`(pop_multiple, ceiling(get(pop_col)/get(pop_pctl_col))),  :     'env' argument does not have names

to be clear I did: env_list <- list(pop_col, pop_pctl_col,loc_col) and applied it to the dt mutation for the line:

dt[rep(!is.na(eval(pop_multiple)), get(pop_multiple)), eval(loc_col) := paste0(get(loc_col),1:pop_multiple), env = env_list]

since the line before that was working as expected. Looking around for information on that error doesn't seem to turn up results that ap

BLP92 Sep 1 at 11:58

Realized I made a mistake and didn't make it as a named list as you said. Added names(env_list) <- c("pop_col", "pop_pctl_col", "loc_col") and then ran it and got the same error of invalid 'times' argument: Error in .checkTypos(e, names_x) : invalid 'times' argument

Collectives™ on Stack Overflow

Replicate rows in data.table using passed column names in user defined function environment

2 Answers 2

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related