Parallelize an R Script

Question

The problem with my R script is that it takes too much time and the main solution that I consider is to parallelize it. I don't know where to start.

My code look like this:

n<- nrow (aa) 
output <- matrix (0, n, n)

akl<- function (dii){
        ddi<- as.matrix (dii)
        m<- rowMeans(ddi)
        M<- mean(ddi)
        r<- sweep (ddi, 1, m)
        b<- sweep (r, 2, m)
        return (b + M)  
        }
for (i in 1:n)
{
A<- akl(dist(aa[i,]))

dVarX <- sqrt(mean (A * A))

for (j in i:n)
{
    B<- akl(dist(aa[j,]))
        V <- sqrt (dVarX * (sqrt(mean(B * B))))

        output[i,j] <- (sqrt(mean(A * B))) / V        
}
}

I would like to parallelize on different cpus. How can I do that? I saw the SNOW package, is it suitable for my purpose? Thank you for suggestions, Gab

the input is a matrix and my code compute something on each row (akl) and then take each rowpair to compute something different (pratically a correlation score). The output is another matrix. — Gabelins
– Gabelins, Commented Jan 14, 2013 at 10:18
By 'working example' ECII means something like: wrap this code in a function that takes a matrix as argument, then create a matrix, then run the function to show us the expected answer. Call the function fooSlowVersion if you want, and then we can try and write fooParallelVersion and compare easily. — Spacedman
– Spacedman, Commented Jan 14, 2013 at 10:33
@Gabelins: I didn't say calculate akl(whatever) once. I said calculate dist(aa) once. And you really should construct an example. — IRTFM
– IRTFM, Commented Jan 14, 2013 at 10:35

Arun · Accepted Answer · 2013-01-14 11:52:10Z

5

There are two ways in which your code could be made to run faster that I could think of:

First: As @Dwin was saying (with a small twist), you could precompute akl (yes, not necesarily dist, but the whole of akl).

# a random square matrix
aa <- matrix(runif(100), ncol=10)
n <- nrow(aa)
output <- matrix (0, n, n)

akl <- function(dii) {
    ddi <- as.matrix(dii)
    m   <- rowMeans(ddi)
    M   <- mean(m) # mean(ddi) == mean(m)
    r   <- sweep(ddi, 1, m)
    b   <- sweep(r, 2, m)
    return(b + M)
}

# precompute akl here
require(plyr)
akl.list <- llply(1:nrow(aa), function(i) {
    akl(dist(aa[i, ]))
})

# Now, apply your function, but index the list instead of computing everytime
for (i in 1:n) {
    A     <- akl.list[[i]]
    dVarX <- sqrt(mean(A * A))

    for (j in i:n) {
        B <- akl.list[[j]]
        V <- sqrt (dVarX * (sqrt(mean(B * B))))
        output[i,j] <- (sqrt(mean(A * B))) / V        
    }
}

This should already get your code to run faster than before (as you compute akl everytime in the inner loop) on larger matrices.

Second: In addition to that, you can get it faster by parallelising as follows:

# now, the parallelisation you require can be achieved as follows
# with the help of `plyr` and `doMC`.

# First step of parallelisation is to compute akl in parallel
require(plyr)
require(doMC)
registerDoMC(10) # 10 Cores/CPUs
    akl.list <- llply(1:nrow(aa), function(i) {
    akl(dist(aa[i, ]))
}, .parallel = TRUE)

# then, you could write your for-loop using plyr again as follows
output <- laply(1:n, function(i) {
    A     <- akl.list[[i]]
    dVarX <- sqrt(mean(A * A))

    t <- laply(i:n, function(j) {
        B <- akl.list[[j]]
        V <- sqrt(dVarX * (sqrt(mean(B*B))))
        sqrt(mean(A * B))/V
    })
    c(rep(0, n-length(t)), t)
}, .parallel = TRUE)

Note that I have added .parallel = TRUE only on the outer loop. This is because, you assign 10 processors to the outer loop. Now, if you add it to both outer and inner loops, then the total number of processers will be 10 * 10 = 100. Please take care of this.

edited Jan 14, 2013 at 11:52

answered Jan 14, 2013 at 11:03

Arun

119k28 gold badges290 silver badges396 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Gabelins Over a year ago

tryed on a large matrix...the first solution is very nice! Thank you so much!

Gabelins Over a year ago

While using the first solution on a 2500 x 700 matrix I get a problem with memory size. Any help for solving?

Gabelins Over a year ago

...one more question: with the first solution, how can I apply a plyr function to avoid the loop in calculating sqrt(mean(A*B))?

Gabelins Over a year ago

I tryed the second solution that you advice...but I get this error Warning message: In setup_parallel() : No parallel backend registered) Is it because I'm not using the doMC package? ..i'm on a Windows machine...

Gabelins Over a year ago

:( seems like that (and the code is a bit too much slow for my work....argh!) Thank you anyway, you were great!

|

Collectives™ on Stack Overflow

Parallelize an R Script

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related