2

I've used Julia for some time, but still know little about it, especially for the paralell computing. I want to obtain a new array with the existing data, as the array is very large, I want to do it in a parallel way, and the code is written as follows:

ψ1 = Array{Complex}(undef, D)
ψ = rand(Complex{Float64},D)
Threads.@threads for k = 1:D   
    ψ1[k] = @views sum(ψ[GetBasis(k - 1, N)])  
end

I run it with julia -t 4, But it turn out to be very slow, compared to the non parallel code as follows,

ψ1 = [@views sum(ψ[GetBasis(k - 1, N)]) for k=1:D]

I have no idea about why this happens, and GetBasis() is just a function to generate a Array, Array{Int64,1}(N).

I would like to ask how I could improve the first code, or, is there some way I can modify the second code to also run it in a parallel way? As the array can be very large, and I want to find a way to speed it up...

Thanks a lot, and look forward to your replies!

The complete code can be found as follows

function GetBasis(n, N)
    xxN = collect(0:N-1)
    BI = BitFlip.(n,xxN).+1
end


function BitFlip(n, i)
    n = n ⊻ (1 << i)
end

N=24
D=2^N

ψ1 = Array{Complex}(undef, D)
ψ = rand(Complex{Float64},D)
Threads.@threads for k = 1:D   
    ψ1[k] = @views sum(ψ[GetBasis(k - 1, N)])  
end

ψ2 = [@views sum(ψ[GetBasis(k - 1, N)]) for k=1:D]

2
  • This is not a reproducible code and cannot be run. Please provide a MWE. Commented Nov 28, 2020 at 20:07
  • Sorry that I made a mistake. I've corrected it, and please find the complete code as well, thanks! Commented Nov 28, 2020 at 21:21

2 Answers 2

2
ψ1 = Array{Complex}(undef, D)

This is an abstract type, which is very slow. You should define the type of the complex components, e.g. Complex{Float64}.

Alternatively, use similar.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much! Now the performance of the code are much better! I'm not sure if there are some other way to write this code, which can be more efficient, or can I write a code similar to the second one(none parallel code), but in a parallel way?
1

When all is done properly the multithreaded code is faster indeed.

function GetBasis(n, N)
    BI = BitFlip.(n,0:N-1).+1
end

function BitFlip(n, i)
    n = n ⊻ (1 << i)
end

const N=24
const D=2^N

const ψ = rand(Complex{Float64},D)
const ψ1 = Vector{Complex{Float64}}(undef, D)

using BenchmarkTools

Threads.nthreads() # should return 4 or more
                   # set JULIA_NUM_THREADS environment variable

Now testing:

julia> GC.gc()

julia> @btime ψ2 = [@views sum(ψ[GetBasis(k - 1, N)]) for k=1:D];
  5.591 s (16777218 allocations: 4.50 GiB)

julia> GC.gc()

julia> @btime Threads.@threads for k = 1:D
           @inbounds ψ1[k] = @views sum(ψ[GetBasis(k - 1, N)])
       end
  2.293 s (16777237 allocations: 4.25 GiB)

Note the amount of memory used by this code - you need to run garbage collector before running the benchamrk and the test will be less meaningful when you have less than 16GB RAM in your machine.

6 Comments

Thank you very much! The performance is much better now! May I ask why should I defined ψ and ψ1 as a const? And I've tried study N=26, and in this case, the parallel code is slower than the first code again, I'm not sure why it is the case, here I have 16GB RAM in my machine.
For N=26 you are running into garbage collection. You need 64GB of RAM to test the performance for N=26.
Regarding const - if you are running benchmarks of code that is outside of a function you need to make it type stable. When you put that code inside a function you can remove consts. See docs.julialang.org/en/v1/manual/performance-tips for details.
Then, does it mean that, in case the RAM of my machine is not enough, the parallel code may be slower than the non-parallel one? If so, is there some way to improve it? And I also want to run it by GPU, but I could not find a function similar to @threads in CUDA of Julia...
For GPU look at juliagpu.github.io/CUDA.jl/stable/usage/array Again you will have memory problem. "the parallel code may be slower than the non-parallel one" - it depends on how garbage collector is going to behave. You would need to bechmark but in that case those will need to be run many hundred times due to high variance caused by the garbage collector.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.