I have a big (12 million rows) data.table which looks like this:
library(data.table)
set.seed(123)
dt <- data.table(id=rep(1:3, each=5),y=sample(letters[1:5],15,replace = T))
> dt
id y
1: 1 b
2: 1 d
3: 1 c
4: 1 e
5: 1 e
6: 2 a
7: 2 c
8: 2 e
9: 2 c
10: 2 c
11: 3 e
12: 3 c
13: 3 d
14: 3 c
15: 3 a
I want to create a new data.table containing my variable id (which will be the unique key of this new data.table) and 5 other binary variables each one corresponding to each category of y which take value 1 if the id has that value for y, 0 otherwise.
The output data.table should look like this:
id a b c d e
1: 1 0 1 1 1 1
2: 2 1 0 1 0 1
3: 3 1 0 1 1 1
I tried doing this in a loop but it's quite slow and also I don't know how to pass the binary variable names programmatically, as they depend on the variable I'm trying to "split".
EDIT: as @mtoto pointed out, a similar question has already been asked and answered here, but the solution is using the reshape2 package.
I was wondering if there's another (faster) way to do so by maybe using the := operator in data.table, as I have a massive dataset and I'm working quite a lot with this package.
EDIT2: benchmark of the functions in @Arun's post on my data (~12 million rows, ~3,5 million different ids and 490 different labels for the y variable (resulting in 490 dummy variables)):
system.time(ans1 <- AnsFunction()) # 194s
system.time(ans2 <- dcastFunction()) # 55s
system.time(ans3 <- TableFunction()) # Takes forever and blocked my PC
data[1][e]=1 if(2>0) else 0but it just seems a little weird.data.table, it should be clearer now: theidn.1 has the distinc valuesb,c,d,efory, but nota. This explains why his row on the seconddata.tablehas1everywhere except for theacolumn. @mtoto thanks for your answer, this would solve my provlem, but with such massive data I was wondering if there was another way to do the same thing but insidedata.table, maybe with the:=operator.data.table, you could go withdcast():dcast(dt, id ~ y,fun.aggregate = function(x) (length(x) > 0)+0)uy = unique(dt$y); m = matrix(0L, max(dt$id), length(uy), dimnames = list(NULL, uy)); m[cbind(dt$id, match(dt$y, uy))] = 1L