40

I am trying to apply multiple functions to multiple columns of a data.table. Example:

DT <- data.table("a"=1:5,
                 "b"=2:6,
                 "c"=3:7)

Let's say I want to get the mean and the median of columns a and b. This works:

stats <- DT[,.(mean_a=mean(a),
               median_a=median(a),
               mean_b=mean(b),
               median_b=median(b))]

But it is way too repetitive. Is there a nice way to achieve a similar result using .SDcols and lapply?

4
  • 1
    Why not put the functions into a custom function and call that? Commented Apr 14, 2015 at 6:48
  • 3
    Or, maybe look at the development version of "data.table" where dcast can handle multiple column aggregations at once. Commented Apr 14, 2015 at 6:49
  • 2
    This may be easier using dplyr summarise_each(DT,funs(mean, median), 1:2) Commented Apr 14, 2015 at 6:50
  • 2
    This'll be better when colwise() is implemented. Commented Apr 14, 2015 at 9:42

5 Answers 5

38

I'd normally do this:

my.summary = function(x) list(mean = mean(x), median = median(x))

DT[, unlist(lapply(.SD, my.summary)), .SDcols = c('a', 'b')]
#a.mean a.median   b.mean b.median 
#     3        3        4        4 
Sign up to request clarification or add additional context in comments.

8 Comments

I had a similar idea but thought the OP wanted a data.table output instead of a vector DT[, as.list(unlist(lapply(.SD, my.summary))), .SDcols = c('a', 'b')]
You could also probably simplify to my.summary = function(x) c(mean = mean(x), median = median(x)) ; DT[, sapply(.SD, my.summary), .SDcols = a:b]
But this seems to be awfully slow if I add a group by category DT[, as.list(unlist(lapply(.SD, my.summary))), by=category, .SDcols=c('a', 'b') ] This is taking much longer than doing each summary individually and then joining. Any faster way to do this? I have about 1.5 million groups within the category column @akrun
Worth mentioning that the output is quite different (long) if adding a by grouping! How would you do then?
I think the code with by would be: as.list(unlist(lapply(...?
|
17

Other answers show how to do it, but no one bothered to explain the basic principle. The basic rule is that elements of lists returned by j expressions form the columns of the resulting data.table. Any j expression that produces a list, each element of which corresponds to a desired column in the result, will work. With this in mind we can use

DT[, c(mean = lapply(.SD, mean),
       median = lapply(.SD, median)),
  .SDcols = c('a', 'b')]
##    mean.a mean.b median.a median.b
## 1:      3      4        3        4

or

DT[, unlist(lapply(.SD,
                   function(x) list(mean = mean(x),
                                    median = median(x))),
            recursive = FALSE),
   .SDcols = c('a', 'b')]
##    a.mean a.median b.mean b.median
## 1:      3        3      4        4

depending on the desired order.

Importantly we can use any method we want to produce the desired result, provided only that we arrange the result into a list as described above. For example,

library(matrixStats)
DT[, c(mean = as.list(colMeans(.SD)),
       median = setNames(as.list(colMedians(as.matrix(.SD))), names(.SD))),
   .SDcols = c('a', 'b')]
##    mean.a mean.b median.a median.b
## 1:      3      4        3        4

also works.

2 Comments

I think the first example does not rename columns, but otherwise this was a very useful answer. Thanks!
Best answer that does not deviate from the principle. I've been always using .() for getting multiple outputs but combining this with lapply like .(lapply(.SD), other = function(col) ) did not work well. I realized using c(lapply(.SD,func), other = function(col)) was the right approach.
12

This is a little bit clumsy but does the job with data.table:

funcs = c('median', 'mean', 'sum')

m = DT[, lapply(.SD, function(u){
        sapply(funcs, function(f) do.call(f,list(u)))
     })][, t(.SD)]
colnames(m) = funcs

#  median mean sum
#a      3    3  15
#b      4    4  20
#c      5    5  25

3 Comments

Adding a new dependency just for one t() call seems to be a bit overhead, why not to use chaining? m = DT[...][, t(.SD)]. I think it is also more readable.
How to include a function that has an additional argument? Example: quantile(., 0.25).
Use Curry from functional package and define funcs = c(median, sum, Curry(quantile, probs=0.25)). But you will have to define colnames yourself at this stage.
3

use dcast

DT$dday <- 1 # add a constant column
dt <- dcast(DT, dday~dday, fun=list(sum, mean), value.var = c('a', 'b'))
# dday a_sum_1 b_sum_1 a_mean_1 b_mean_1
# 1      15      20        3        4

In fact, we can use dcast to implement onehot and feature engineer.

Comments

3

This might be a little over-engineered, but if you come from dplyr's summarize_at() you might want to have a similar structured result.

First define a function lapply_at() which takes a .SD and a character vector of function names as inputs. Then you can easily compute your desired statistics and get a readable result.

library(data.table)
iris_dt <- as.data.table(iris)

lapply_at <- function(var, funs, ...) {
  results <- sapply(var, function(var) {
    lapply(funs, do.call, list(var, ...))
  })
  names(results) <- vapply(names(var), paste, funs, sep = "_", 
                           FUN.VALUE = character(length(funs)),
                           USE.NAMES = FALSE)
  results
}

iris_dt[, lapply_at(.SD, c("mean", "sd"), na.rm = TRUE), 
        .SDcols = patterns("^Sepal"),
        by = Species]

#>       Species Sepal.Length_mean Sepal.Length_sd Sepal.Width_mean
#> 1:     setosa             5.006       0.3524897            3.428
#> 2: versicolor             5.936       0.5161711            2.770
#> 3:  virginica             6.588       0.6358796            2.974
#>    Sepal.Width_sd
#> 1:      0.3790644
#> 2:      0.3137983
#> 3:      0.3224966

Created on 2019-07-03 by the reprex package (v0.2.0).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.