3

I am trying to optimize this nested for loop, which takes the min of 2 numbers, and then adds the result to the dataframe. I was able to cut it down significantly using vectorizing and initializing, but I'm not too sure how to apply that logic to a nested for loop. Is there a quick way to make this run faster? Sitting on over 5 hours of run time.

"Simulation" has 100k values, and "limits" has 5427 values

output <- data.frame(matrix(nrow = nrow(simulation),ncol = nrow(limits)))
res <- character(nrow(simulation))

for(i in 1:nrow(limits)){
    for(j in 1:nrow(simulation)){
        res[j] <- min(limits[i,1],simulation[j,1])
    }
    output[,i] <- res
}

edit*

dput(head(simulation))
    structure(list(simulation = c(124786.7479,269057.2118,80432.47896,119513.0161,660840.5843,190983.7893)), .Names = "simulation", row.names = c(NA,6L), class = "data.frame")

dput(head(limits))
    structure(list(limits = c(5000L,10000L,20000L,25000L,30000L)), .Names = "limits", row.names = c(NA, 6L), class = "data.frame")
3
  • Take a look at the apply family, I think lapply would work in your situation. It can effectively replace for and operates more rapidly (or so I've found and read of others finding). Also, can we get a dput(head(simulation)) and dput(head(limits))? So we can see the structure of the data? If you're fully vectorized sapply may get the job done (I'm not great with it though). Commented Sep 27, 2017 at 22:20
  • You're doing 542 Million calculations. What on earth are you going to do with the resulting output matrix? Commented Sep 27, 2017 at 22:39
  • @thelatemail calculating limited variance/std. dev for a complicated distribution, no good formula to just calculate theoretical values so we are using a simulation Commented Sep 27, 2017 at 22:43

2 Answers 2

1

If you have >15GB in RAM (~100K * 5500 * 8 bytes per number * 3 (result + outer x vals + outer y vals)) you can try:

outer(simulation[[1]], limits[[1]], pmin)

Although in reality you'll probably need more than 15GB because I think pmin will duplicate stuff even more. If you don't have the ram you'll have to break up the problem (e.g. rely on code that does a column at a time or some such).

Sign up to request clarification or add additional context in comments.

Comments

1

Basically, when you have a double-loop, it is often useful to use Rcpp.

Moreover, I will use package bigstatsr to save you some RAM. You can create and access matrices that are stored on your disk.

So, you can do:

simulation <- structure(list(simulation = c(124786.7479,269057.2118,80432.47896,119513.0161,660840.5843,190983.7893)), .Names = "simulation", row.names = c(NA,6L), class = "data.frame")
limits <- structure(list(limits = c(5000L,10000L,15000L, 20000L,25000L,30000L)), .Names = "limits", row.names = c(NA, 6L), class = "data.frame")

library(bigstatsr)
# Create the filebacked matrix on disk (in `/tmp/` by default)
mat <- FBM(nrow(simulation), nrow(limits))
# Fill this matrix in Rcpp
Rcpp::sourceCpp('fill-FBM.cpp')
fillMat(mat, limits[[1]], simulation[[1]])  
# Access the whole matrix in RAM to verify
# or you could access only block of columns
mat[]
mat[, 1:3]

where 'fill-FBM.cpp' is

// [[Rcpp::depends(bigstatsr, BH)]]
#include <bigstatsr/BMAcc.h>
#include <Rcpp.h>
using namespace Rcpp;


// [[Rcpp::export]]
void fillMat(Environment BM,
             const NumericVector& limits,
             const NumericVector& simulation) {

  XPtr<FBM> xpBM = BM["address"];
  BMAcc<double> macc(xpBM);

  int n = macc.nrow();
  int m = macc.ncol();

  for (int i = 0; i < m; i++)
    for (int j = 0; j < n; j++)
      macc(j, i) = std::min(limits[i], simulation[j]);
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.