The snippet is a generic version of what I am trying to accomplish.
program main
integer, save:: j,Nj,k,Nk,AllocateStatus
!double precision:: array(100000)
double precision, dimension(:), allocatable:: array
INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS,
+ OMP_GET_THREAD_NUM
Nj = 100000
Nk = 100
allocate(array(Nk),STAT = AllocateStatus)
IF (AllocateStatus /= 0) STOP "*** Not enough memory ***"
array = 0.0
!$OMP PARALLEL PRIVATE(NTHREADS, TID)
!$OMP DO
do j=1, Nj
!print *, "id", OMP_GET_THREAD_NUM()
!DO COMPUTATIONALLY INTENSIVE PART
do k=1, Nk
array(k)=array(k)+1
enddo
enddo
!$OMP END DO
!$OMP END PARALLEL
print *, array
stop
end
In the non-openmp version every element of array will be 100000. With openmp as used in the snippet, I get array elements of about 99000. I'm not clear on what change I need to make to get the openmp version to yield the same output as the serial version.
EDIT:
The operations done in the outer loop have no dependence on each other, but the output of these operations needs be cumulatively tracked in a variable like array. So a temporary array for each thread which can then be combined once the outer loop is complete should work, but I don't know how to do the reduction. Does the code below make sense?
program main
integer, save:: j,Nj,k,Nk,AllocateStatus
!double precision:: array(100000)
double precision, dimension(:), allocatable:: a_tmp,a_total
INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS,
+ OMP_GET_THREAD_NUM
Nj = 100000
Nk = 100
allocate(a_tmp(Nk),STAT = AllocateStatus)
IF (AllocateStatus /= 0) STOP "*** Not enough memory ***"
allocate(a_total(Nk),STAT = AllocateStatus)
IF (AllocateStatus /= 0) STOP "*** Not enough memory ***"
a_tmp = 0.0
a_total = 0.0
!$OMP PARALLEL PRIVATE(NTHREADS, TID,a_tmp)
!$OMP DO
do j=1, Nj
!print *, "id", OMP_GET_THREAD_NUM()
do k=1, Nk
a_tmp(k)=a_tmp(k)+1
enddo
enddo
!$OMP END DO
a_total=a_total+a_tmp
!$OMP END PARALLEL
print *, a_total
stop
end