0

I need to speed up code using data.table. I am getting stuck on how to reference variables that are being indexed from a vector.

data:

df <- data.frame(
  id=c(1,1,1,2,2,2,3,3,3),
  year=as.character(c(2014, 2015, 2016, 2015, 2015, 2016, NA, NA, 2016)),
  code=c(1,2,2, 1,2,3, 3,4,5),
  dv1=1:9,
  dv2=2:10
) %>% as.data.table()

dtplyr code:

cols <- c("dv1", "dv2")

test <- function(data, columns, group) {
for(i in seq_along(columns)) {
 sub1 <- df %>% 
   select("id", columns[i], group) %>%
   group_by(.data[[group]]) %>%
   summarise(mean=mean(.data[[columns[i]]], na.rm=T), sd=sd(.data[[columns[i]]], na.rm=T)) %>%
   ungroup() %>%
   as_tibble() 
 print(sub1)
}
}

data.table attempt:

test <- function(data, columns, group) {
  for(i in seq_along(columns)) {
    sub1 <- df %>% 
      .[, .(id, columns[i], group)] %>%
      .[, .(mean(.data[[columns[i]]], na.rm=T), sd=sd(.data[[columns[i]]], na.rm=T)), by=.data[[group]]] %>%
      as_tibble() 
    print(sub1)
  }
}

test(data=df, columns=cols, group="year")

This works on a single variable:

df %>% 
  .[, .(id, dv1, year)] %>%
  .[, .(mean(dv1, na.rm=T), sd=sd(dv1, na.rm=T)), by=year] %>%
  as_tibble() 
4

2 Answers 2

2
  • .data is not used in data.table
  • You don't need select here and that is why you also don't need .[, .(id, columns[i], group)] in data.table version.
  • You can use get to get column values based on string.

Since this is just an example I have not tried to simplify the loop so that you can add more complicated stuff in there later.

library(data.table)

cols <- c("dv1", "dv2")

test <- function(data, columns, group) {
  for(i in columns) {
    sub1 <-df[, .(mean(get(i), na.rm=T), sd=sd(get(i), na.rm=T)), by=year]
    print(sub1)
  }
}

test(data=df, columns=cols, group="year")

#   year   V1    sd
#1: 2014 1.00    NA
#2: 2015 3.67 1.528
#3: 2016 6.00 3.000
#4: <NA> 7.50 0.707

#   year   V1    sd
#1: 2014 2.00    NA
#2: 2015 4.67 1.528
#3: 2016 7.00 3.000
#4: <NA> 8.50 0.707
Sign up to request clarification or add additional context in comments.

4 Comments

This is very instructive. I attempted to simplify but I do need the select step. This step (.[, .(id, get(columns[i]), get(group))]) gets me the correct data but the column names change to V1 and V2. Any advice on how to keep the column names intact while selecting?
Why exactly do you need the select step? What changes in the output if you use it or not?
There is an additional step left out here where I remove duplicates on these columns. Agreed that here it does not make a difference.
In the loop you can create a vector of column names to select. col <- c(i, group, 'id') and then use it as df[, ..col] to select the columns.
1

This likely will require a fairly unintuitive as.list/unlist construction:


df <- data.frame(
  id=c(1,1,1,2,2,2,3,3,3),
  year=as.character(c(2014, 2015, 2016, 2015, 2015, 2016, NA, NA, 2016)),
  code=c(1,2,2, 1,2,3, 3,4,5),
  dv1=1:9,
  dv2=2:10
) %>% as.data.table()

summary.func <- function(x) {
    list( mean=mean(x), sd=sd(x) )
}

df[, as.list(unlist(lapply(.SD, summary.func))), by=group, .SDcols=cols ]

It produces:


   year dv1.mean    dv1.sd dv2.mean    dv2.sd
1: 2014 1.000000        NA 2.000000        NA
2: 2015 3.666667 1.5275252 4.666667 1.5275252
3: 2016 6.000000 3.0000000 7.000000 3.0000000
4: <NA> 7.500000 0.7071068 8.500000 0.7071068

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.