Recent OpenMP standards introduced task multi-dependencies. This feature was originally proposed in OmpSs if I'm not mistaken. I found an example of applying multi-dependencies to the assembly step of a finite element method.
For educational purposes I'm trying to replicate the assembly example in 1-d using dense storage for simplicity:
integer, parameter :: N = 10
real(dp), parameter :: L = 1.0_dp
real(dp), dimension(N+1) :: nodes
real(dp), dimension(2,2) :: Ke, Me, A_local
real(dp), dimension(N+1,N+1) :: A
real(dp) :: h
h = L / real(N, dp)
Ke = D / h * reshape([1.0_dp, -1.0_dp, -1.0_dp, 1.0_dp], [2,2])
Me = sigma * h / 6.0_dp * reshape([2.0_dp, 1.0_dp, 1.0_dp, 2.0_dp], [2,2])
A = 0.0_dp
! Loop over elements
do i = 1, N
A_local = Ke + Me
A(i:i+1, i:i+1) = A(i:i+1, i:i+1) + A_local
end do
The conventional approach to parallelise this might use atomic updates or a critical section to avoid data races, e.g.
! Loop over elements
!$omp parallel do default(private) shared(A,N,Ke,Me)
do i = 1, N
A_local = Ke + Me
!$omp critical
A(i:i+1, i:i+1) = A(i:i+1, i:i+1) + A_local
!$omp end critical
end do
I would now like to try achieve the same thing using multi-dependencies instead of critical. I started with the following task loop:
!$omp parallel default(shared) private(A_local)
!$omp single
do i = 1, N
!$omp task depend(mutexinoutset: A(i:i+1,i:i+1)) firstprivate(i)
A_local = Ke + Me
A(i:i+1, i:i+1) = A(i:i+1, i:i+1) + A_local
!$omp end task
end do
!$omp end single
!$omp end parallel
This does not work. As explained here, the storage locations in the depend clause are not allowed to be disjoint.
My next idea was to use an iterator,
!$omp task depend(iterator(it=i:i+1), mutexinoutset: A(it,it)) firstprivate(i)
which captures dependencies on the diagonal matrix elements. This happens to be correct in case of 1-d finite elements, but I don't think would be valid in 2-d or 3-d, where the connectivity pattern is more complex.
My third idea was to use a trivial index list to enforce the dependencies:
integer, dimension(N+1) :: adj
do i = 1, N
adj(i) = i
end do
combined with the clause:
!$omp task depend(iterator(it=i:i+1), mutexinoutset: adj(it)) firstprivate(i)
Again, this happens to work because the neighbour dependencies are simple (each finite element) depends on the neighbouring two nodes. In practice this solution appears to work as expected
The last idea I had was to use a full-blown neighbour/adjacency list, i.e.
integer, dimension(2,N) :: nel
do i = 1, n
nel(:,i) = [i,i+1]
end do
in combination with the clause,
!$omp task depend(mutexinoutset: nel(:,i))
again this happens to work because the dependency graph is very simple.
In all cases the body of the task remains the same.
Which of the above-mentioned approaches is the recommended one? Am I using the multidependency feature "correctly"?
!$omp task depend(iterator(it1=i:i+1,it2=i:i+1), mutexinoutset: A(it1,it2)) firstprivate(i). I don't think that you will gain an performance advantage over using atomic updates of the array elements. OpenMP tasks introduce a heavy overhead that needs significant work (>10**5 instructions) per task to amortize.