2

I'm trying to apply a function to all rows of a data.table while using multiple columns as inputs with an output that could be one or two rows of a data.frame/matrix/what-have-you per row. My data.table has 800,000 rows.

Here is my closest attempt. The things that are at play here are of course correctness, efficiency, and ease of use with the output structure.

library(data.table)
d0 = as.Date("2014/01/01")
sdays = seq(d0,d0+99,by=1)
gg=data.table(id=1:100,event_date = sdays)  
setkey(gg, id)

test_func = function(id,day){
  delta = day - d0
  if(delta == 0 ){
    rcomb = c(id, 0, 100, 1,0)
  } else if(delta != 100 ){
    r1 = c(id, 0, delta, 0, 0)
    r2 = c(id, delta, 100,   1, 0)
    rcomb = rbind(r1,r2)
  }
  rcomb
}

att = gg[, test_func( get("id"), get("event_date")), by=id]
att

Any ideas on how to use fast data.table tricks here? I've been at it for hours and haven't gotten much closer :/ As for the output, I would prefer it be a list with one entry per original row so then i could just call do.call and rbind. Thanks!

So let me give an example of the desired output, but in a horribly inefficient way:

some_list = vector("list", 100)
for(i in 1:100) {
  some_list[[i]] <- test_func(gg$id[i], gg$event_date[i])
}
happy=do.call(rbind,some_list)
head(happy)
   [,1] [,2] [,3] [,4] [,5]
      1    0  100    1    0
r1    2    0    1    0    0
r2    2    1  100    1    0
r1    3    0    2    0    0
r2    3    2  100    1    0
r1    4    0    3    0    0
4
  • I think you don't need get here . gg[,test_func(id, event_date), id] Commented Feb 17, 2015 at 4:01
  • yes, good point! But it still gives the wrong result :| Commented Feb 17, 2015 at 4:08
  • What is the expected result as per your example dataset? Commented Feb 17, 2015 at 4:09
  • Not exactly.. if you take a look at head(att,n=20) you'll notice the irregular pattern that joins the last parts of each vectors sequentially. This is also a problem because in the data one cannot be sure if there will be 1-row or 2-row output. Edit: In response to transpose comment Commented Feb 17, 2015 at 4:29

1 Answer 1

2

If you want to create 4 columns for your data.table, something like the following will work

test_func = function(day){
    delta = day - d0
    if(delta == 0 ){
        rcomb = list(0, 100, 1,0)
    } else if(delta != 100 ){
     rcomb <- list(c(0,delta), c(100,delta), c(0,1), c(0,0))

    }
    rcomb
}

att = gg[, test_func(event_date), by=id]
att
Sign up to request clarification or add additional context in comments.

3 Comments

Seems to work pretty well for 2k! Let's see how it scales to 800k
This is still too slow :/ your solution is correct in the test case though. I guess I might just try to parallelize this...
Woah, I almost gave up on it but it took something like a minute, plus or minus. So far this is the best option I have been able to work with so thank you!!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.