Create tables by using data.table and a for loop for multiple columns

Question

I need to speed up code using data.table. I am getting stuck on how to reference variables that are being indexed from a vector.

data:

df <- data.frame(
  id=c(1,1,1,2,2,2,3,3,3),
  year=as.character(c(2014, 2015, 2016, 2015, 2015, 2016, NA, NA, 2016)),
  code=c(1,2,2, 1,2,3, 3,4,5),
  dv1=1:9,
  dv2=2:10
) %>% as.data.table()

dtplyr code:

cols <- c("dv1", "dv2")

test <- function(data, columns, group) {
for(i in seq_along(columns)) {
 sub1 <- df %>% 
   select("id", columns[i], group) %>%
   group_by(.data[[group]]) %>%
   summarise(mean=mean(.data[[columns[i]]], na.rm=T), sd=sd(.data[[columns[i]]], na.rm=T)) %>%
   ungroup() %>%
   as_tibble() 
 print(sub1)
}
}

data.table attempt:

test <- function(data, columns, group) {
  for(i in seq_along(columns)) {
    sub1 <- df %>% 
      .[, .(id, columns[i], group)] %>%
      .[, .(mean(.data[[columns[i]]], na.rm=T), sd=sd(.data[[columns[i]]], na.rm=T)), by=.data[[group]]] %>%
      as_tibble() 
    print(sub1)
  }
}

test(data=df, columns=cols, group="year")

This works on a single variable:

df %>% 
  .[, .(id, dv1, year)] %>%
  .[, .(mean(dv1, na.rm=T), sd=sd(dv1, na.rm=T)), by=year] %>%
  as_tibble()

I think I'm missing something here - why do you need a loop? What is wrong with df %>% group_by(group) %>% summarise(across(all_of(cols), c(mean = mean, sd = sd))) ? — jared_mamrot
– jared_mamrot, Commented Mar 17, 2021 at 22:13
See e.g. Apply multiple functions to multiple columns in data.table; Multiple functions on multiple columns by group, and create informative column names; Calculate multiple aggregations on several variables using lapply(.SD, …). — Henrik
– Henrik, Commented Mar 17, 2021 at 22:24
@jared_mamrot The code is part of a larger function that is easier to write with a for loop I believe. I am trying to convert the code to data.table code for speed and to avoid increased use of memory. — EML
– EML, Commented Mar 17, 2021 at 22:24
@Henrik Any references where a for loop is used specifically? Thanks! — EML
– EML, Commented Mar 18, 2021 at 0:51

Ronak Shah · Accepted Answer · 2021-03-18 05:04:27Z

2

.data is not used in data.table
You don't need select here and that is why you also don't need .[, .(id, columns[i], group)] in data.table version.
You can use get to get column values based on string.

Since this is just an example I have not tried to simplify the loop so that you can add more complicated stuff in there later.

library(data.table)

cols <- c("dv1", "dv2")

test <- function(data, columns, group) {
  for(i in columns) {
    sub1 <-df[, .(mean(get(i), na.rm=T), sd=sd(get(i), na.rm=T)), by=year]
    print(sub1)
  }
}

test(data=df, columns=cols, group="year")

#   year   V1    sd
#1: 2014 1.00    NA
#2: 2015 3.67 1.528
#3: 2016 6.00 3.000
#4: <NA> 7.50 0.707

#   year   V1    sd
#1: 2014 2.00    NA
#2: 2015 4.67 1.528
#3: 2016 7.00 3.000
#4: <NA> 8.50 0.707

answered Mar 18, 2021 at 5:04

Ronak Shah

391k20 gold badges173 silver badges237 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

EML Over a year ago

This is very instructive. I attempted to simplify but I do need the select step. This step (.[, .(id, get(columns[i]), get(group))]) gets me the correct data but the column names change to V1 and V2. Any advice on how to keep the column names intact while selecting?

Ronak Shah Over a year ago

Why exactly do you need the select step? What changes in the output if you use it or not?

EML Over a year ago

There is an additional step left out here where I remove duplicates on these columns. Agreed that here it does not make a difference.

Ronak Shah Over a year ago

In the loop you can create a vector of column names to select. col <- c(i, group, 'id') and then use it as df[, ..col] to select the columns.

Sirius · Accepted Answer · 2021-03-18 01:00:26Z

1

This likely will require a fairly unintuitive as.list/unlist construction:


df <- data.frame(
  id=c(1,1,1,2,2,2,3,3,3),
  year=as.character(c(2014, 2015, 2016, 2015, 2015, 2016, NA, NA, 2016)),
  code=c(1,2,2, 1,2,3, 3,4,5),
  dv1=1:9,
  dv2=2:10
) %>% as.data.table()

summary.func <- function(x) {
    list( mean=mean(x), sd=sd(x) )
}

df[, as.list(unlist(lapply(.SD, summary.func))), by=group, .SDcols=cols ]

It produces:


   year dv1.mean    dv1.sd dv2.mean    dv2.sd
1: 2014 1.000000        NA 2.000000        NA
2: 2015 3.666667 1.5275252 4.666667 1.5275252
3: 2016 6.000000 3.0000000 7.000000 3.0000000
4: <NA> 7.500000 0.7071068 8.500000 0.7071068

answered Mar 18, 2021 at 1:00

Sirius

5,4492 gold badges16 silver badges27 bronze badges

Collectives™ on Stack Overflow

Create tables by using data.table and a for loop for multiple columns

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related