3

I'm trying to convert a "big" factor into a set of indicator (i.e. dummy, binary, flag) variables in R as such:

FLN <- data.frame(nnet::class.ind(FinelineNumber))

where FinelineNumber is a 5,000-level factor from Kaggle.com's current Walmart contest (the data is public if you'd like to reproduce this error).

I keep getting this concerning-looking warning:

In n * (unclass(cl) - 1L) : NAs produced by integer overflow

Memory available to the system is essentially unlimited. I'm not sure what the problem is.

9
  • How many rows does your data have... FLN <- data.frame(class.ind(paste(1:5000, "a"))) runs without problem on my old lappie. Commented Dec 18, 2015 at 20:10
  • 1
    perhaps stat.ethz.ch/R-manual/R-devel/library/Matrix/html/… is useful Commented Dec 18, 2015 at 20:13
  • 1
    I was going to agree with @user20650. It's going to be hard for people on limited-memory systems to reproduce this. On my laptop the results of z <- factor(rep(1:5000,n)); FLN <- data.frame(nnet::class.ind(z)) are either, depending on n, (1) fine; (2) obvious errors about the matrix being too large, or being out of memory; (3) crashing my R session due to too-large memory requests Commented Dec 18, 2015 at 20:15
  • 1
    @user20650 it has about 650,000 rows. It's running on a server with 36 cores and 100GB free RAM. I will give the sparse matrix function a try; thanks Commented Dec 18, 2015 at 20:23
  • 2
    Easily, because you're indexing the matrix, so it involves multiplying 5000L * 650000L. Commented Dec 18, 2015 at 20:25

1 Answer 1

6

The source code of nnet::class.ind is:

function (cl)     {
    n <- length(cl)
    cl <- as.factor(cl)
    x <- matrix(0, n, length(levels(cl)))
    x[(1L:n) + n * (unclass(cl) - 1L)] <- 1
    dimnames(x) <- list(names(cl), levels(cl))
    x
}

.Machine$integer.max is 2147483647. If n*(nlevels - 1L) is greater than this value that should produce your error. Solving for n:

imax <- .Machine$integer.max
nlevels <- 5000
imax/(nlevels-1L)
## [1] 429582.6

You'll encounter this problem if you have 429583 or more rows (not particularly big for a data-mining context). As commented above, you'll do much better with Matrix::sparse.model.matrix (or Matrix::fac2sparse), if your modeling framework can handle sparse matrices. Alternatively, you'll have to rewrite class.ind to avoid this bottleneck (i.e. indexing by rows and columns rather than by absolute location) [@joran comments above that R indexes large vectors via double-precision values, so you might be able to get away with just hacking that line to

x[(1:n) + n * (unclass(cl) - 1)] <- 1

possibly throwing in an explicit as.numeric() here or there to force the coercion to double ...]

Even if you were able to complete this step, you'd end up with a 5000*650000 matrix - it looks like that will be 12Gb.

 print(650*object.size(matrix(1L,5000,1000)),units="Gb")

I guess if you've got 100Gb free that could be OK ...

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks much; good answer. I thought @user20650 was referring to the fac2sparse function in Matrix so I tried that instead of sparse.model.matrix and it also worked very well.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.