2

I would like to calculate the summation of elements from a large upper triangular matrix. The regular Julia code is below.

function upsum(M); n = size(M)[1]; sum = 0
    for i = 1:n-1 for j = i+1:n
        sum = sum + M[i,j]
        end
    end
    return sum
end

R = randn(10000,10000)
upsum(R)

Since the matrix is very large, I would like to know is there anyway to improve the speed. How can I use parallel computing here?

1

1 Answer 1

4

I would use threads not parallel processing in this case. Here is an example code:

using Base.Threads

function upsum_threads(M)
    n = size(M, 1)
    chunks = nthreads()
    sums = zeros(eltype(M), chunks)
    chunkend = round.(Int, n * sqrt.((1:chunks) ./ chunks))
    @assert minimum(diff(chunkend)) > 0
    chunkstart = [2; chunkend[1:end-1] .+ 1]
    @threads for job in 1:chunks
        s = zero(eltype(M))
        for i in chunkstart[job]:chunkend[job]
            @simd for j in 1:(i-1)
                @inbounds s += M[j, i]
            end
        end
        sums[job] = s
    end
    return sum(sums)
end

R = randn(10000,10000)
upsum_threads(R)

It should give you a significant speedup (even if you remove @threads it should be much faster).

You choose number of threads Julia uses by setting JULIA_NUM_THREADS environment variable.

Sign up to request clarification or add additional context in comments.

4 Comments

I really appreciate your help! I have another question, what if I also want to calculate the summation of the square of each elements from the large upper triangular matrix at the same time. I tried @inbounds s += M[j, i] @inbounds sq += M[j, i]^2, it works. Is there any better way to do that?
It is probably better to have two inner loops than two operations in one inner loop as I think @simd should work then (but you have to benchmark if there is any significant difference between these two designs).
Thanks for your reply! Do you mean it's better to have two @threads for job in 1:chunks.
I meant inner loop @simd for j in 1:(i-1), as then M[j,i] should still be in CPU cache (but I have not benchmarked it so duplicating the outermost loop might not be significantly slower).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.