5

I am struggling with solving a particular issue I have and I have searched stackoverflow and found examples that are close but not quite what I want. The example that comes closest is here

This post (here) also comes close but I can't get my multiple output function to work with list()

What I want to do, is to create table with aggregated values (min, max, mean, MyFunc) grouped by a key. I have also have some complex functions that returns multiple outputs. I could return single outputs but that would mean running the complex function many times and would take too long.

Using Matt Dowle's example from the this post with some change …

x <- data.table(a=1:3,b=1:6)[]
   a b
1: 1 1
2: 2 2
3: 3 3
4: 1 4
5: 2 5
6: 3 6

This is the type of output I want. An aggregate table (here only with mean and sum)

agg.dt <- x[ , list(mean=mean(b), sum=sum(b)), by=a][]
   a mean sum
1: 1  2.5   5
2: 2  3.5   7
3: 3  4.5   9

This example function f returns 3 outputs. My real function is much more complex, and the constituents can't be split out like this.

f <- function(x) {list(length(x), min(x), max(x))}

Matt Dowle's suggestion on previous post works great, but doesn't produce and aggregate table, instead the aggregates are added to the main table (which is also very useful in other circumstances)

x[, c("length","min", "max"):= f(b), by=a][]
   a b length min max
1: 1 1      2   1   4
2: 2 2      2   2   5
3: 3 3      2   3   6
4: 1 4      2   1   4
5: 2 5      2   2   5
6: 3 6      2   3   6

What I really want to do (if possible), is something along these lines …

agg.dt <- x[ , list(mean=mean(b)
                       , sum=sum(b)
                       , c("length","min", "max") = f(b)
), by=a]

and return an aggregate table looking something like this …

     a mean sum length min max
1: 1  2.5   5           2   1   4
2: 2  3.5   7           2   2   5
3: 3  4.5   9           2   3   6

I can only really see a solution where this is a two stage process and merging/joining tables together?

3
  • Why do you add [] every where. This is unnecessary. Also, have you tried agg.dt <- x[, f(b), by=a] ; setnames(agg.dt, names(agg.dt), c("a","length","min", "max"))? Or you can modify your function and then simply run it f <- function(x) {list(length = length(x), min = min(x), max = max(x))}; agg.dt <- x[, f(b), by=a] Commented Aug 21, 2014 at 11:38
  • Thanks for your suggestions. The trailing [] were copied from the previous post and you are correct that in my code examples they are unnecessary. However, they can be useful in assignment statement such as the one here x[,c("mean","sum"):=list(mean(b),sum(b)),by=a][] as the assignment silently updates the data.table. Adding the trailing square brackets simply forces data.table to print the output to the console. Commented Aug 21, 2014 at 12:11
  • @KAE You should use print for that. It's better code style and might even be more efficient. Commented Aug 22, 2014 at 6:48

1 Answer 1

8
library(data.table)
x <- data.table(a=1:3,b=1:6)
#have the function return a named list
f <- function(x) {list(length=length(x), 
                       min=min(x), 
                       max=max(x))}

# c can combine lists
# c(vector, vector, 3-list) is a 5-list
agg.dt <- x[ , c(mean=mean(b),
                 sum=sum(b),
                 f(b)), 
            by=a]

#   a mean sum length min max
#1: 1  2.5   5      2   1   4
#2: 2  3.5   7      2   2   5
#3: 3  4.5   9      2   3   6

Alternatively, drop names from f() to save the time and cost of creating the same names for each group :

f <- function(x) {list(length(x), 
                       min(x), 
                       max(x))}

agg.dt <- x[ , c(mean(b),
                 sum(b),
                 f(b)),
            by=a]

setnames(agg.dt, c("a", "mean","sum","length", "min", "max"))

This drop-names-and-put-them-back-afterwards trick (for speed when you have lots of groups) does't reach inside f(). f() could return anything so that's harder for data.table to optimize automatically.

Just to mention as well that base::list() no longer copies named inputs, as from R 3.1. So the common R idiom of a function f() doing some complex steps then returning a list() of local variables at the end, should be faster now.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.