How to programmatically create binary columns based on a categorical variable in data.table?

Question

I have a big (12 million rows) data.table which looks like this:

library(data.table)
set.seed(123)
dt <- data.table(id=rep(1:3, each=5),y=sample(letters[1:5],15,replace = T))
> dt
    id y
 1:  1 b
 2:  1 d
 3:  1 c
 4:  1 e
 5:  1 e
 6:  2 a
 7:  2 c
 8:  2 e
 9:  2 c
10:  2 c
11:  3 e
12:  3 c
13:  3 d
14:  3 c
15:  3 a

I want to create a new data.table containing my variable id (which will be the unique key of this new data.table) and 5 other binary variables each one corresponding to each category of y which take value 1 if the id has that value for y, 0 otherwise.
The output data.table should look like this:

   id a b c d e
1:  1 0 1 1 1 1
2:  2 1 0 1 0 1
3:  3 1 0 1 1 1

I tried doing this in a loop but it's quite slow and also I don't know how to pass the binary variable names programmatically, as they depend on the variable I'm trying to "split".

EDIT: as @mtoto pointed out, a similar question has already been asked and answered here, but the solution is using the reshape2 package.
I was wondering if there's another (faster) way to do so by maybe using the := operator in data.table, as I have a massive dataset and I'm working quite a lot with this package.

EDIT2: benchmark of the functions in @Arun's post on my data (~12 million rows, ~3,5 million different ids and 490 different labels for the y variable (resulting in 490 dummy variables)):

system.time(ans1 <- AnsFunction())   # 194s
system.time(ans2 <- dcastFunction()) # 55s
system.time(ans3 <- TableFunction()) # Takes forever and blocked my PC

I notice there are similar rows such as four and five, can you explain this data a little better? As I understand it data[1][e]=1 if(2>0) else 0 but it just seems a little weird. — kpie
– kpie, Commented Jun 10, 2016 at 7:25
Possible duplicate of How to use cast or another function to create a binary table in R — mtoto
– mtoto, Commented Jun 10, 2016 at 7:29
@kpie I edited the second data.table, it should be clearer now: the id n.1 has the distinc values b,c,d,e for y, but not a. This explains why his row on the second data.table has 1 everywhere except for the a column. @mtoto thanks for your answer, this would solve my provlem, but with such massive data I was wondering if there was another way to do the same thing but inside data.table, maybe with the := operator. — hellter
– hellter, Commented Jun 10, 2016 at 7:34
If you want to use data.table, you could go with dcast(): dcast(dt, id ~ y,fun.aggregate = function(x) (length(x) > 0)+0) — mtoto
– mtoto, Commented Jun 10, 2016 at 7:56
You might, also, consider having your 1/0 in a "matrix", probably sparse to have a chance of saving some memory -- uy = unique(dt$y); m = matrix(0L, max(dt$id), length(uy), dimnames = list(NULL, uy)); m[cbind(dt$id, match(dt$y, uy))] = 1L — alexis_laz
– alexis_laz, Commented Jun 10, 2016 at 13:49

Arun · Accepted Answer · 2016-06-10 09:51:56Z

7

data.table has its own dcast implementation using data.table's internals and should be fast. Give this a try:

dcast(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L)
#    id a b c d e
# 1:  1 0 1 1 1 1
# 2:  2 1 0 1 0 1
# 3:  3 1 0 1 1 1

Just thought of another way to handle this by preallocating and updating by reference (perhaps dcast's logic should be done like this to avoid intermediates).

ans = data.table(id = unique(dt$id))[, unique(dt$y) := 0L][]

All that's left is to fill existing combinations with 1L.

dt[, {set(ans, i=.GRP, j=unique(y), value=1L); NULL}, by=id]
ans
#    id b d c e a
# 1:  1 1 1 1 1 0
# 2:  2 0 0 1 1 1
# 3:  3 0 1 1 1 1

Okay, I've gone ahead on benchmarked on OP's data dimensions with ~10 million rows and 10 columns.

require(data.table)
set.seed(45L)
y = apply(matrix(sample(letters, 10L*20L, TRUE), ncol=20L), 1L, paste, collapse="")
dt = data.table(id=sample(1e5,1e7,TRUE), y=sample(y,1e7,TRUE))

system.time(ans1 <- AnsFunction())   # 2.3s
system.time(ans2 <- dcastFunction()) # 2.2s
system.time(ans3 <- TableFunction()) # 6.2s

setcolorder(ans1, names(ans2))
setcolorder(ans3, names(ans2))
setorder(ans1, id)
setkey(ans2, NULL)
setorder(ans3, id)

identical(ans1, ans2) # TRUE
identical(ans1, ans3) # TRUE

where,

AnsFunction <- function() {
    ans = data.table(id = unique(dt$id))[, unique(dt$y) := 0L][]
    dt[, {set(ans, i=.GRP, j=unique(y), value=1L); NULL}, by=id]
    ans
    # reorder columns outside
}

dcastFunction <- function() {
    # no need to load reshape2. data.table has its own dcast as well
    # no need for setDT
    df <- dcast(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L,value.var = "y")
}

TableFunction <- function() {
    # need to return integer results for identical results
    # fixed 1 -> 1L; as.numeric -> as.integer
    df <- as.data.frame.matrix(table(dt$id, dt$y))
    df[df > 1L] <- 1L
    df <- cbind(id = as.integer(row.names(df)), df)
    setDT(df)
}

edited Jun 10, 2016 at 9:51

answered Jun 10, 2016 at 7:58

Arun

119k28 gold badges290 silver badges396 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

hellter Over a year ago

Your approach looks like exactly what I was looking for. I get the sense, but when I run the code of your second approach on dt it doesn't work and I get Empty data.table (0 rows) of 1 col: id

Arun Over a year ago

@helter, could you edit your Q to show a benchmark of the run time between the two methods posted above on your original data?

hellter Over a year ago

That's not an issue at all, I just couldn't do it before and I thought @Tobias' benchmark was enough. I just added the benchmark in the question.

Arun Over a year ago

Awesome, thanks. I plan to work on improving dcast for next release. Definitely helps in knowing how not to go about improving dcast().

hellter Over a year ago

I think that the slowest part in TableFunction is table(dt$id, dt$y). In fact working on this dataset I noticed that, in general, table() is extremely slow, maybe because I have so many ids. For this reason, in general I tend to use data.table's .N operator in the j argument while subsetting by=id. Maybe changing that bit inside TableFunction would improve performance (?), but I don't see how to obtain the same output of the first line of TableFunction without table()

|

Tobias Dekker · Accepted Answer · 2016-06-10 11:15:12Z

1

For small data sets the table function seems to be more efficient, but on large datasets dcast seems to be the most efficient and convenient option.

TableFunction <- function(){
    df <- as.data.frame.matrix(table(dt$id, dt$y))
    df[df > 1] <- 1
    df <- cbind(id = as.numeric(row.names(df)), df)
    setDT(df)
}


AnsFunction <- function(){
    ans = data.table(id = unique(dt$id))[, unique(dt$y) := 0L][]
    dt[, {set(ans, i=id, j=unique(y), value=1L); NULL}, by=id]
}

dcastFunction <- function(){
    df <-dcast.data.table(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L,value.var = "y")

}

library(data.table)
library(microbenchmark)
set.seed(123)
N = 10000
dt <- data.table(id=rep(1:N, each=5),y=sample(letters[1 : 5], N*5, replace = T)) 


microbenchmark(
    "dcast" = dcastFunction(),
    "Table" = TableFunction(),
    "Ans"   = AnsFunction()
    )


 Unit: milliseconds
  expr       min        lq      mean    median        uq       max neval cld
 dcast  42.48367  45.39793  47.56898  46.83755  49.33388  60.72327   100  b 
 Table  28.32704  28.74579  29.14043  29.00010  29.23320  35.16723   100 a  
   Ans 120.80609 123.95895 127.35880 126.85018 130.12491 156.53289   100   c

> all(test1 == test2)
[1] TRUE
> all(test1 == test3)
[1] TRUE

y = apply(matrix(sample(letters, 10L*20L, TRUE), ncol=20L), 1L, paste, collapse="")
dt = data.table(id=sample(1e5,1e7,TRUE), y=sample(y,1e7,TRUE))

microbenchmark(
    "dcast" = dcastFunction(),
    "Table" = TableFunction(),
    "Ans"   = AnsFunction()
)
Unit: seconds
  expr      min       lq     mean   median       uq      max neval cld
 dcast 1.985969 2.064964 2.189764 2.216138 2.266959 2.643231   100 a  
 Table 5.022388 5.403263 5.605012 5.580228 5.830414 6.318729   100   c
   Ans 2.234636 2.414224 2.586727 2.599156 2.645717 2.982311   100  b

edited Jun 10, 2016 at 11:15

answered Jun 10, 2016 at 8:36

Tobias Dekker

1,03010 silver badges20 bronze badges

2 Comments

Arun Over a year ago

I've added a benchmark on larger data to my post. I'm not sure if you're running data.table's dcast or reshape2's, since you use setDT(), which won't be necessary if you use data.table's. And reshape2::dcast is slow.

alexis_laz Over a year ago

Instead of table + [<-.data.frame, al alternative is

uy = unique(dt$y); m = matrix(0L, max(dt$id), length(uy), dimnames = list(NULL, uy)); m[cbind(dt$id, match(dt$y, uy))] = 1L

kpie · Accepted Answer · 2016-06-10 07:42:41Z

If you already know the range of the rows (as in you know that there are no more than 3 rows in your example) and you know the columns you can start with an array of zeros and use the apply function to update values in that secondary table.

My R is a little rust but i think that should work. Additionally the function you pass to the apply method could contain conditions to add necessary rows and columns as is needed.

My R is a little rust so I'm a bit tentative to write it up right now, but I think that's the way to do it.

If you are looking for something a little more plug and play I found this little blerb:

There are two sets of methods that are explained below:

gather() and spread() from the tidyr package. This is a newer interface to the reshape2 package.

melt() and dcast() from the reshape2 package.

There are a number of other methods which aren’t covered here, since they are not as easy to use:

The reshape() function, which is confusingly not part of the reshape2 package; it is part of the base install of R.

stack() and unstack()

from here :: http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/

If I was better versed in R I would tell you how those various methods handle collisions going from long lists to wide on. I was googling up "Make a table from flat data in R" to come up with this...

Also Check out this It's that same website as above with my personal comment wrapper : p

Collectives™ on Stack Overflow

How to programmatically create binary columns based on a categorical variable in data.table?

3 Answers 3

8 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

8 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related