Calculate Array in a Parallel Way in Julia

Question

I've used Julia for some time, but still know little about it, especially for the paralell computing. I want to obtain a new array with the existing data, as the array is very large, I want to do it in a parallel way, and the code is written as follows:

ψ1 = Array{Complex}(undef, D)
ψ = rand(Complex{Float64},D)
Threads.@threads for k = 1:D   
    ψ1[k] = @views sum(ψ[GetBasis(k - 1, N)])  
end

I run it with julia -t 4, But it turn out to be very slow, compared to the non parallel code as follows,

ψ1 = [@views sum(ψ[GetBasis(k - 1, N)]) for k=1:D]

I have no idea about why this happens, and GetBasis() is just a function to generate a Array, Array{Int64,1}(N).

I would like to ask how I could improve the first code, or, is there some way I can modify the second code to also run it in a parallel way? As the array can be very large, and I want to find a way to speed it up...

Thanks a lot, and look forward to your replies!

The complete code can be found as follows

function GetBasis(n, N)
    xxN = collect(0:N-1)
    BI = BitFlip.(n,xxN).+1
end


function BitFlip(n, i)
    n = n ⊻ (1 << i)
end

N=24
D=2^N

ψ1 = Array{Complex}(undef, D)
ψ = rand(Complex{Float64},D)
Threads.@threads for k = 1:D   
    ψ1[k] = @views sum(ψ[GetBasis(k - 1, N)])  
end

ψ2 = [@views sum(ψ[GetBasis(k - 1, N)]) for k=1:D]

This is not a reproducible code and cannot be run. Please provide a MWE. — Przemyslaw Szufel
– Przemyslaw Szufel, Commented Nov 28, 2020 at 20:07
Sorry that I made a mistake. I've corrected it, and please find the complete code as well, thanks! — wangjz87
– wangjz87, Commented Nov 28, 2020 at 21:21

lungben · Accepted Answer · 2020-11-28 21:27:53Z

2

ψ1 = Array{Complex}(undef, D)

This is an abstract type, which is very slow. You should define the type of the complex components, e.g. Complex{Float64}.

Alternatively, use similar.

answered Nov 28, 2020 at 21:27

lungben

1,1787 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

wangjz87 Over a year ago

Thank you very much! Now the performance of the code are much better! I'm not sure if there are some other way to write this code, which can be more efficient, or can I write a code similar to the second one(none parallel code), but in a parallel way?

Przemyslaw Szufel · Accepted Answer · 2020-11-29 00:07:24Z

1

When all is done properly the multithreaded code is faster indeed.

function GetBasis(n, N)
    BI = BitFlip.(n,0:N-1).+1
end

function BitFlip(n, i)
    n = n ⊻ (1 << i)
end

const N=24
const D=2^N

const ψ = rand(Complex{Float64},D)
const ψ1 = Vector{Complex{Float64}}(undef, D)

using BenchmarkTools

Threads.nthreads() # should return 4 or more
                   # set JULIA_NUM_THREADS environment variable

Now testing:

julia> GC.gc()

julia> @btime ψ2 = [@views sum(ψ[GetBasis(k - 1, N)]) for k=1:D];
  5.591 s (16777218 allocations: 4.50 GiB)

julia> GC.gc()

julia> @btime Threads.@threads for k = 1:D
           @inbounds ψ1[k] = @views sum(ψ[GetBasis(k - 1, N)])
       end
  2.293 s (16777237 allocations: 4.25 GiB)

Note the amount of memory used by this code - you need to run garbage collector before running the benchamrk and the test will be less meaningful when you have less than 16GB RAM in your machine.

answered Nov 29, 2020 at 0:07

Przemyslaw Szufel

42.5k3 gold badges42 silver badges73 bronze badges

6 Comments

wangjz87 Over a year ago

Thank you very much! The performance is much better now! May I ask why should I defined ψ and ψ1 as a const? And I've tried study N=26, and in this case, the parallel code is slower than the first code again, I'm not sure why it is the case, here I have 16GB RAM in my machine.

Przemyslaw Szufel Over a year ago

For N=26 you are running into garbage collection. You need 64GB of RAM to test the performance for N=26.

Przemyslaw Szufel Over a year ago

Regarding const - if you are running benchmarks of code that is outside of a function you need to make it type stable. When you put that code inside a function you can remove consts. See docs.julialang.org/en/v1/manual/performance-tips for details.

wangjz87 Over a year ago

Then, does it mean that, in case the RAM of my machine is not enough, the parallel code may be slower than the non-parallel one? If so, is there some way to improve it? And I also want to run it by GPU, but I could not find a function similar to @threads in CUDA of Julia...

Przemyslaw Szufel Over a year ago

For GPU look at juliagpu.github.io/CUDA.jl/stable/usage/array Again you will have memory problem. "the parallel code may be slower than the non-parallel one" - it depends on how garbage collector is going to behave. You would need to bechmark but in that case those will need to be run many hundred times due to high variance caused by the garbage collector.

|

Collectives™ on Stack Overflow

Calculate Array in a Parallel Way in Julia

2 Answers 2

1 Comment

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related