The source code of nnet::class.ind is:
function (cl) {
n <- length(cl)
cl <- as.factor(cl)
x <- matrix(0, n, length(levels(cl)))
x[(1L:n) + n * (unclass(cl) - 1L)] <- 1
dimnames(x) <- list(names(cl), levels(cl))
x
}
.Machine$integer.max is 2147483647. If n*(nlevels - 1L) is greater than this value that should produce your error. Solving for n:
imax <- .Machine$integer.max
nlevels <- 5000
imax/(nlevels-1L)
## [1] 429582.6
You'll encounter this problem if you have 429583 or more rows (not particularly big for a data-mining context). As commented above, you'll do much better with Matrix::sparse.model.matrix (or Matrix::fac2sparse), if your modeling framework can handle sparse matrices. Alternatively, you'll have to rewrite class.ind to avoid this bottleneck (i.e. indexing by rows and columns rather than by absolute location) [@joran comments above that R indexes large vectors via double-precision values, so you might be able to get away with just hacking that line to
x[(1:n) + n * (unclass(cl) - 1)] <- 1
possibly throwing in an explicit as.numeric() here or there to force the coercion to double ...]
Even if you were able to complete this step, you'd end up with a 5000*650000 matrix - it looks like that will be 12Gb.
print(650*object.size(matrix(1L,5000,1000)),units="Gb")
I guess if you've got 100Gb free that could be OK ...
FLN <- data.frame(class.ind(paste(1:5000, "a")))runs without problem on my old lappie.z <- factor(rep(1:5000,n)); FLN <- data.frame(nnet::class.ind(z))are either, depending onn, (1) fine; (2) obvious errors about the matrix being too large, or being out of memory; (3) crashing my R session due to too-large memory requests5000L * 650000L.