Integer overflow from many-leveled factor with class.ind()?

Question

I'm trying to convert a "big" factor into a set of indicator (i.e. dummy, binary, flag) variables in R as such:

FLN <- data.frame(nnet::class.ind(FinelineNumber))

where FinelineNumber is a 5,000-level factor from Kaggle.com's current Walmart contest (the data is public if you'd like to reproduce this error).

I keep getting this concerning-looking warning:

In n * (unclass(cl) - 1L) : NAs produced by integer overflow

Memory available to the system is essentially unlimited. I'm not sure what the problem is.

How many rows does your data have... FLN <- data.frame(class.ind(paste(1:5000, "a"))) runs without problem on my old lappie. — user20650
– user20650, Commented Dec 18, 2015 at 20:10
perhaps stat.ethz.ch/R-manual/R-devel/library/Matrix/html/… is useful — user20650
– user20650, Commented Dec 18, 2015 at 20:13
I was going to agree with @user20650. It's going to be hard for people on limited-memory systems to reproduce this. On my laptop the results of z <- factor(rep(1:5000,n)); FLN <- data.frame(nnet::class.ind(z)) are either, depending on n, (1) fine; (2) obvious errors about the matrix being too large, or being out of memory; (3) crashing my R session due to too-large memory requests — Ben Bolker
– Ben Bolker, Commented Dec 18, 2015 at 20:15
@user20650 it has about 650,000 rows. It's running on a server with 36 cores and 100GB free RAM. I will give the sparse matrix function a try; thanks — Hack-R
– Hack-R, Commented Dec 18, 2015 at 20:23
Easily, because you're indexing the matrix, so it involves multiplying 5000L * 650000L. — joran
– joran, Commented Dec 18, 2015 at 20:25

Ben Bolker · Accepted Answer · 2015-12-18 20:39:55Z

6

The source code of nnet::class.ind is:

function (cl)     {
    n <- length(cl)
    cl <- as.factor(cl)
    x <- matrix(0, n, length(levels(cl)))
    x[(1L:n) + n * (unclass(cl) - 1L)] <- 1
    dimnames(x) <- list(names(cl), levels(cl))
    x
}

.Machine$integer.max is 2147483647. If n*(nlevels - 1L) is greater than this value that should produce your error. Solving for n:

imax <- .Machine$integer.max
nlevels <- 5000
imax/(nlevels-1L)
## [1] 429582.6

You'll encounter this problem if you have 429583 or more rows (not particularly big for a data-mining context). As commented above, you'll do much better with Matrix::sparse.model.matrix (or Matrix::fac2sparse), if your modeling framework can handle sparse matrices. Alternatively, you'll have to rewrite class.ind to avoid this bottleneck (i.e. indexing by rows and columns rather than by absolute location) [@joran comments above that R indexes large vectors via double-precision values, so you might be able to get away with just hacking that line to

x[(1:n) + n * (unclass(cl) - 1)] <- 1

possibly throwing in an explicit as.numeric() here or there to force the coercion to double ...]

Even if you were able to complete this step, you'd end up with a 5000*650000 matrix - it looks like that will be 12Gb.

 print(650*object.size(matrix(1L,5000,1000)),units="Gb")

I guess if you've got 100Gb free that could be OK ...

edited Dec 18, 2015 at 20:39

answered Dec 18, 2015 at 20:23

Ben Bolker

230k26 gold badges405 silver badges497 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Hack-R Over a year ago

Thanks much; good answer. I thought @user20650 was referring to the fac2sparse function in Matrix so I tried that instead of sparse.model.matrix and it also worked very well.

Collectives™ on Stack Overflow

Integer overflow from many-leveled factor with class.ind()?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related