Julia parallel computing for loop

Question

I would like to calculate the summation of elements from a large upper triangular matrix. The regular Julia code is below.

function upsum(M); n = size(M)[1]; sum = 0
    for i = 1:n-1 for j = i+1:n
        sum = sum + M[i,j]
        end
    end
    return sum
end

R = randn(10000,10000)
upsum(R)

Since the matrix is very large, I would like to know is there anyway to improve the speed. How can I use parallel computing here?

Possible duplicate of Julia: efficient parallelization of dissimilarity matrix computation — user3133925
– user3133925, Commented Mar 13, 2019 at 17:16

Bogumił Kamiński · Accepted Answer · 2019-03-13 19:33:58Z

4

I would use threads not parallel processing in this case. Here is an example code:

using Base.Threads

function upsum_threads(M)
    n = size(M, 1)
    chunks = nthreads()
    sums = zeros(eltype(M), chunks)
    chunkend = round.(Int, n * sqrt.((1:chunks) ./ chunks))
    @assert minimum(diff(chunkend)) > 0
    chunkstart = [2; chunkend[1:end-1] .+ 1]
    @threads for job in 1:chunks
        s = zero(eltype(M))
        for i in chunkstart[job]:chunkend[job]
            @simd for j in 1:(i-1)
                @inbounds s += M[j, i]
            end
        end
        sums[job] = s
    end
    return sum(sums)
end

R = randn(10000,10000)
upsum_threads(R)

It should give you a significant speedup (even if you remove @threads it should be much faster).

You choose number of threads Julia uses by setting JULIA_NUM_THREADS environment variable.

answered Mar 13, 2019 at 19:33

Bogumił Kamiński

70.2k3 gold badges96 silver badges121 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Mizzle Over a year ago

I really appreciate your help! I have another question, what if I also want to calculate the summation of the square of each elements from the large upper triangular matrix at the same time. I tried @inbounds s += M[j, i] @inbounds sq += M[j, i]^2, it works. Is there any better way to do that?

Bogumił Kamiński Over a year ago

It is probably better to have two inner loops than two operations in one inner loop as I think @simd should work then (but you have to benchmark if there is any significant difference between these two designs).

Mizzle Over a year ago

Thanks for your reply! Do you mean it's better to have two @threads for job in 1:chunks.

Bogumił Kamiński Over a year ago

I meant inner loop @simd for j in 1:(i-1), as then M[j,i] should still be in CPU cache (but I have not benchmarked it so duplicating the outermost loop might not be significantly slower).

Collectives™ on Stack Overflow

Julia parallel computing for loop

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related