can you share what the loop was doing?
Here is the source code. The figures are summary data.
include ‘integer_kind_module.f90’
include ‘precision_module.f90’
include ‘timing_module.f90’
program ch3305
use timing_module
use precision_module
use omp_lib
use iso_fortran_env
implicit none
! Original/Book 10,000,000
!
! Latest version has a variable memory footprint
!
!
! This version checks the memory allocation
! for each loop
!
integer :: n
integer :: allocation_status
integer, parameter :: start_size = 12810241024
integer, parameter :: loop_count = 10
integer, parameter :: n_types = 4
integer :: i
integer :: j
integer :: k
integer :: l
integer :: nthreads
real (dp), allocatable, dimension (
:: x
real (dp), allocatable, dimension (
:: y
real (dp), allocatable, dimension (
:: z
real, dimension (n_types, loop_count) :: timing_details = 0.0
real, dimension (n_types) :: t_sum = 0.0
real, dimension (n_types) :: t_average = 0.0
real :: reset = 0.0
character (15), dimension (n_types) :: heading_1 = &
[ ’ Whole array ', &
’ Do loop ', &
’ Do concurrent ', &
’ openmp ’ ]
print * , ’ ’
print *,compiler_version()
print *,compiler_options()
print *, ’ ’
nthreads = omp_get_max_threads()
open (unit=20, file=‘ch3305.dat’)
print 100, nthreads
100 format (’ Nthreads = ', i3)
!
! Dynamic allocation
!
! The loop count l depends on the amount
! of memory the system has.
!
! I use native Windows and Linux installs
! and for these systems the memory is the
! actual physical memory.
!
! I also use
!
! Linux under wsl
! Linux under hyper-v
!
! Both of these have less than the physical memory available.
!
! Adjust l accordingly
!
! I use the following values with the systems I have
!
! 192 GB system - l=8
! 32 GB system - l=6
! 16 GB system - l=3
l=8
do k=1,l
print *,‘’
call start_timing()
print *,‘’
n = k * start_size
print *,''
print *,' Problem size = ' , k*128 , ' * 1024 * 1024'
print *,''
allocate (x(n),stat=allocation_status)
if (allocation_status > 0) then
print *,’ Allocation error’
print *,’ Program terminates’
stop 10
end if
allocate (y(n),stat=allocation_status)
if (allocation_status > 0) then
print *,’ Allocation error’
print *,’ Program terminates’
stop 20
end if
allocate (z(n),stat=allocation_status)
if (allocation_status > 0) then
print *,’ Allocation error’
print *,’ Program terminates’
stop 30
end if
!
! Initialisation
!
call random_number(x)
call random_number(y)
z = 0.0_dp
print 110, time_difference()
110 format (’ Initialise time = ', f12.6)
write (20, 120) x(1), y(1), z(1)
120 format (3(2x,f6.3))
print *, ’ ’
do j = 1, loop_count
print 130, j
130 format (' Iteration = ', i3)
!
! Whole array syntax
!
z = x + y
timing_details(1, j) = time_difference()
write (20, 120) x(1), y(1), z(1)
z = 0.0_dp
reset = time_difference()
!
! Simple traditional do loop
!
do i = 1, n
z(i) = x(i) + y(i)
end do
timing_details(2, j) = time_difference()
z = 0.0_dp
reset = time_difference()
!
! do concurrent loop
!
do concurrent (i=1:n)
z(i) = x(i) + y(i)
end do
timing_details(3, j) = time_difference()
write (20, 120) x(1), y(1), z(1)
z = 0.0_dp
reset = time_difference()
!
! OpenMP parallel loop
!
!$omp parallel do
do i = 1, n
z(i) = x(i) + y(i)
end do
!$omp end parallel do
timing_details(4, j) = time_difference()
write (20, 120) x(1), y(1), z(1)
z = 0.0_dp
reset = time_difference()
end do
print 140
140 format (15x, ’ Sum Average’)
do i = 1, n_types
t_sum(i) = sum(timing_details(i,1:loop_count))
t_average(i) = t_sum(i)/loop_count
print 150, heading_1(i), t_sum(i), t_average(i)
150 format (a, 2(3x,f12.6))
end do
deallocate (x)
deallocate (y)
deallocate (z)
print *, ’ ’
call end_timing()
print *,’ ’
end do
close (20)
end program
Some asked for the include files. Here they are
module integer_kind_module
implicit none
integer, parameter :: i8 = selected_int_kind(2)
integer, parameter :: i16 = selected_int_kind(4)
integer, parameter :: i32 = selected_int_kind(9)
integer, parameter :: i64 = selected_int_kind(15)
end module
module precision_module
implicit none
! Precision support by compiler
! May 2025
! Number of bits 16 32 64 80 128
!
! NAG Y Y Y Y
! Intel Y Y Y
! gfortran Y Y Y Y
! Cray Y Y Y
! nvidia Y Y
! AMD Y Y Y
! Silverfrost Y Y Y
! nee Salford
! Sun/Oracle Y Y Y
! single, double, quad naming used by lapack.
! sp - single precision
! dp - double precision
! qp - quad precision
!
! We use
! hp - half precison
! The Fortran 2008 standard introduced
! REAL32
! REAL64
! REAL128
! The 2023 standard introduced
! REAL16
! Our default kind module
integer, parameter :: hp = selected_real_kind( 3, 4)
integer, parameter :: sp = selected_real_kind( 6, 37)
integer, parameter :: dp = selected_real_kind(15, 307)
integer, parameter :: qp = selected_real_kind(30, 291)
! We use
! integer, parameter :: r80 = selected_real_kind(18, 291)
! for 80 bit support in the gfortran and
! Salford/Silverfrost compilers
end module
module timing_module
use integer_kind_module
use precision_module
implicit none
integer, dimension (8), private :: dt
real (dp) :: r_count
real (dp) :: r_count_rate
real (dp) :: start_time = 0.0_dp
real (dp) :: end_time = 0.0_dp
real (dp) :: last_time = 0.0_dp
real (dp) :: total_time = 0.0_dp
real (dp) :: difference = 0.0_dp
integer (i64) :: count,count_rate,count_max
integer (i64) , parameter :: nag_count_rate = 10000000
integer (i64) , parameter :: gfortran_count_rate = 1000000000
integer (i64) , parameter :: intel_count_rate = 1000000
contains
subroutine start_timing()
implicit none
call date_and_time(values=dt)
print 100, dt(1:3), dt(5:8)
100 format (1x, i4, '/', i2, '/', i2, 1x, i2, ':', i2, ':', i2, 1x, i3)
call system_clock(count,count_rate,count_max)
r_count = count
r_count_rate = count_rate
start_time = r_count/r_count_rate
last_time = start_time
end subroutine start_timing
subroutine end_timing()
implicit none
call date_and_time(values=dt)
print 100, dt(1:3), dt(5:8)
100 format (1x, i4, '/', i2, '/', i2, 1x, i2, ':', i2, ':', i2, 1x, i3)
call system_clock(count)
r_count = count
end_time = r_count/r_count_rate
total_time = end_time - start_time
print 200,total_time
200 format(' Total time = ',12x,f18.6)
end subroutine end_timing
subroutine print_time_difference()
implicit none
call system_clock(count)
r_count = count
end_time = r_count/r_count_rate
difference = end_time - last_time
last_time = end_time
print 100, difference
100 format(' Time difference = ',f18.6)
end subroutine print_time_difference
function time_difference()
implicit none
real (dp) :: time_difference
call system_clock(count)
r_count = count
end_time = r_count/r_count_rate
time_difference = end_time - last_time
last_time = end_time
end function time_difference
end module
I aplogise for the lost formatting. I obviously got the use of the </> wrong.
@cmaapic Thanks for the code.
I tried to use it on my Ryzen 5900X which has dual channel memory and the multi thread testing showed that the do loop being used
do i = 1, n
z(i) = x(i) + y(i)
end do
This loop is not suitable for multi-thread testing as there is insufficient computation per cycle, so all my test is showing is performance is limited by memory bandwidth.
( When n is much larger than L3 cache memory bandwidth is the problem, while when n is small the multi-threading overhead, OpenMP or other approach, out weighs any calculation advantage. )
This is the result I would expect using multi-thread DO CONCURRENT on my Ryzen 5900X.
The problem I see is that there needs to be more calculation per itteration or less average memory access per cycle.
I do not have available quad channel or 8-channel memory processors to see if they are more suitable for DO C loops.
Assuming there is no error in my testing approach, I would conclude that similar do concurrent loops (with little computation per cycle) are not suitable for multi-thread off loading of do concurrent loops. Which makes me wonder what type of DO C loops are suitable ? Or what types of hardware ? It looks to me that the sweet spot for DO C off-loading is very limited.
@cmaapic, what processor type have you used for your tests ?
The following is the modified test code I used, where I have used increased annotation of !$OMP to identify a performance improvement.
module integer_kind_module
implicit none
integer, parameter :: i8 = selected_int_kind(2)
integer, parameter :: i16 = selected_int_kind(4)
integer, parameter :: i32 = selected_int_kind(9)
integer, parameter :: i64 = selected_int_kind(15)
end module integer_kind_module
module precision_module
implicit none
! Precision support by compiler
! May 2025
! Number of bits 16 32 64 80 128
!
! NAG Y Y Y Y
! Intel Y Y Y
! gfortran Y Y Y Y
! Cray Y Y Y
! nvidia Y Y
! AMD Y Y Y
! Silverfrost Y Y Y
! nee Salford
! Sun/Oracle Y Y Y
! single, double, quad naming used by lapack.
! sp - single precision
! dp - double precision
! qp - quad precision
!
! We use
! hp - half precison
! The Fortran 2008 standard introduced
! REAL32
! REAL64
! REAL128
! The 2023 standard introduced
! REAL16
! Our default kind module
integer, parameter :: hp = selected_real_kind( 3, 4)
integer, parameter :: sp = selected_real_kind( 6, 37)
integer, parameter :: dp = selected_real_kind(15, 307)
integer, parameter :: qp = selected_real_kind(30, 291)
! We use
! integer, parameter :: r80 = selected_real_kind(18, 291)
! for 80 bit support in the gfortran and
! Salford/Silverfrost compilers
end module precision_module
module timing_module
use integer_kind_module
use precision_module
implicit none
real (dp) :: r_count
real (dp) :: r_count_rate
real (dp) :: start_time = 0.0_dp
real (dp) :: end_time = 0.0_dp
real (dp) :: last_time = 0.0_dp
real (dp) :: total_time = 0.0_dp
real (dp) :: difference = 0.0_dp
integer (i64) :: count,count_rate,count_max
integer (i64) , parameter :: nag_count_rate = 10000000
integer (i64) , parameter :: gfortran_count_rate = 1000000000
integer (i64) , parameter :: intel_count_rate = 1000000
contains
subroutine start_timing()
implicit none
call print_date_and_time
call system_clock(count,count_rate,count_max)
r_count = count
r_count_rate = count_rate
start_time = r_count/r_count_rate
last_time = start_time
end subroutine start_timing
subroutine end_timing()
implicit none
call print_date_and_time
call system_clock(count)
r_count = count
end_time = r_count/r_count_rate
total_time = end_time - start_time
print 200,total_time
200 format(' Total time = ',12x,f18.6)
end subroutine end_timing
function time_difference ()
implicit none
real (dp) :: time_difference
call system_clock (count)
r_count = count
end_time = r_count/r_count_rate
time_difference = end_time - last_time
last_time = end_time
end function time_difference
subroutine print_time_difference ()
implicit none
difference = time_difference ()
print 100, difference
100 format(' Time difference = ',f18.6)
end subroutine print_time_difference
subroutine print_date_and_time
integer, dimension (8) :: dt ! , private only in module
call date_and_time(values=dt)
print 100, dt(1:3), dt(5:8)
100 format (1x, i4, '/', i2, '/', i2, 1x, i2, ':', i2, ':', i2, 1x, i3)
end subroutine print_date_and_time
end module timing_module
! include ‘integer_kind_module.f90'
! include ‘precision_module.f90'
! include ‘timing_module.f90'
program ch3305
use timing_module
use precision_module
use omp_lib
use iso_fortran_env
implicit none
! Original/Book 10,000,000
!
! Latest version has a variable memory footprint
!
!
! Dynamic allocation
!
! The loop count l depends on the amount
! of memory the system has.
!
! I use native Windows and Linux installs
! and for these systems the memory is the
! actual physical memory.
!
! I also use
!
! Linux under wsl
! Linux under hyper-v
!
! Both of these have less than the physical memory available.
!
! Adjust size_count accordingly
!
! I use the following values with the systems I have
!
! 192 GB system - size_count=8
! 32 GB system - size_count=6
! 16 GB system - size_count=3
integer, parameter :: size_count = 7 ! 64 GBytes
!
! This version checks the memory allocation
! for each loop
!
integer :: n
integer :: allocation_status
integer, parameter :: start_size = 128*1024*1024 ! 128 mega variables
! integer, parameter :: start_size = 16*1024*1024 ! 16 million for quicker tests
integer, parameter :: loop_count = 10
integer, parameter :: n_types = 4
integer :: i
integer :: j
integer :: k
integer :: nthreads, threads_to_use
real (dp), allocatable, dimension (:) :: x
real (dp), allocatable, dimension (:) :: y
real (dp), allocatable, dimension (:) :: z
real (dp) :: zz
real (dp), dimension (loop_count, n_types) :: timing_details = 0.0
real (dp), dimension (size_count, n_types) :: t_sum = 0.0
real (dp), dimension (size_count, n_types) :: t_average = 0.0
real (dp) :: reset = 0.0
real :: gbytes
character (15), dimension (n_types) :: heading_1 = &
[ ' Whole array ', &
' Do loop ', &
' Do concurrent', &
' openmp ' ]
print * , ' '
print *,compiler_version()
print *,compiler_options()
print *, ' '
nthreads = omp_get_max_threads()
threads_to_use = min ( 6, nthreads ) ! limit threads to better see change
print 100, nthreads, threads_to_use
100 format (' Max threads = ', i3 / &
' threads_to_use = ', i3 )
open (unit=20, file='ch3305.dat')
do k=1,size_count
print *,' '
call start_timing()
print *,' '
n = k * start_size
gbytes = n * 3. * 8. / 1024. / 1024. / 1024.
write (*,*) gbytes, ' gbytes to be allocated',k
print *,' '
print *,' Problem size = ' , k*128 , ' * 1024 * 1024'
print *,' '
allocate (x(n),stat=allocation_status)
if (allocation_status > 0) then
print *,' Allocation error'
print *,' Program terminates'
stop 10
end if
allocate (y(n),stat=allocation_status)
if (allocation_status > 0) then
print *,' Allocation error'
print *,' Program terminates'
stop 20
end if
allocate (z(n),stat=allocation_status)
if (allocation_status > 0) then
print *,' Allocation error'
print *,' Program terminates'
stop 30
end if
!
! Initialisation
!
call random_number(x)
call random_number(y)
z = 0.0_dp
timing_details = 0
print 110, time_difference ()
110 format (' Initialise time = ', f12.6)
write (20, 120) x(1), y(1), z(1)
120 format (3(2x,f6.3))
print *, ' '
do j = 1, loop_count
print 130, j
130 format (' Iteration = ', i3)
!
! Whole array syntax
!
z = x + y
timing_details(j, 1) = time_difference ()
write (20, 120) x(1), y(1), z(1)
z = 0.0_dp
reset = time_difference ()
!
! Simple traditional do loop
!
do i = 1, n
z(i) = x(i) + y(i)
end do
timing_details(j, 2) = time_difference ()
z = 0.0_dp
reset = time_difference ()
!
! do concurrent loop
!
do concurrent (i=1:n)
z(i) = x(i) + y(i)
end do
timing_details(j, 3) = time_difference ()
write (20, 120) x(1), y(1), z(1)
z = 0.0_dp
reset = time_difference ()
!
! OpenMP parallel loop
!
zz = 0
!$OMP PARALLEL DO DEFAULT (NONE) &
!$OMP& SHARED (n,x,y,z) PRIVATE (i) &
! !$OMP& REDUCTION(+ : zz) &
!$OMP& SCHEDULE (STATIC) &
!$OMP& num_threads (threads_to_use)
do i = 1, n
z(i) = x(i) + y(i)
! zz = zz + x(i) + y(i)
end do
!$OMP END PARALLEL DO
timing_details(j, 4) = time_difference ()
write (20, 120) x(1), y(1), z(1) ! zz
z = 0.0_dp
reset = time_difference ()
end do ! j loop_count
print 140
140 format (15x, ' Sum Average')
do i = 1, n_types
t_sum(k,i) = sum( timing_details(1:loop_count,i) )
t_average(k,i) = t_sum(k,i)/loop_count
print 150, heading_1(i), t_sum(k,i), t_average(k,i)
150 format (a, 2(3x,f12.6))
end do
deallocate (x)
deallocate (y)
deallocate (z)
print *, ' '
call end_timing()
print *,' '
end do ! k size_count
do i = 1, n_types
print 160, heading_1(i), (t_average(k,i), k=1,size_count)
160 format (a, 20f12.6 )
end do
close (20)
end program ch3305
````Preformatted text`
This is why I like more using:
do concurrent(j=1:ny,k=1:nk,i=1:nx)
c(i,j,k) = alpha * a(i,j,k) + b(i,j,k)
end do
I’d rather be doing cubic work
(a lot of this work is to demonstrate the capabilities for 3d type loops. A vector like loop will only be good if the dimension of the vector is long enough, of course
@jorgeg
I still see the same trivial calculation per itteration.
I am attempting to use the following MATMUL as a test, but is it a valid do concurrent loop ?
z = 0.0_dp
do concurrent (ii=1:n,jj=1:n,kk=1:n)
z(ii,kk) = z(ii,kk) + x(ii,jj) * y(jj,kk)
end do
The way OpenMP can help here is to use an if as cutoff:
!$omp parallel do if(n > 100000)
do i = 1, n
z(i) = x(i) + y(i)
end do
The second alternative would be to use a “thread throttling” function get_optimal_nthreads (derived empirically) like:
!$omp parallel do num_threads(get_optimal_nthreads(n))
do i = 1, n
z(i) = x(i) + y(i)
end do
Some of the performance libraries, like OpenBLAS do such things, for instance: OpenBLAS/kernel/arm64/dot.c at f6df9bebbb4259aa61ab5634c0f1269fb152cc0e · OpenMathLib/OpenBLAS · GitHub.
It gets complicated to do this for multiple CPU architectures.
Would you know if there’s a nice way to print out the information of what the compiler is doing with the omp target / do concurrent loops in a similar way to nvidia’s -Minfo= ? I can get the optimization log, but it is 15k lines long and very very detailed (this is using -Rpass-analysis or -Rpass ). If there’s not, where can I make some noise to have something similar? it is very useful to see how things are being parallelized without having to look at the assembly haha
This is an interesting example, as if using !$OMP, my preferred approach is:
!$OMP parallel do shared (alpha,a,b,c,ny,nk) private(j,k) schedule(static)
do k=1,nk
do j=1,ny
c(:,j,k) = alpha * a(:,j,k) + b(:,j,k)
end do
end do
!$OMP end parallel do
I am assuming :
- nk is larger than num_threads, so that collapse is not effective or required ( I think collapse is over-used)
- the inner loop i=1:nx is replaced by array syntax, to highlight that avx instructions would be used
- my explicit do order enhances sequential memory addressing, which I am wondering if do concurrent would always adjust.
The problem with this example is again can the calculation be sufficient to overcome the Openmp startup, while not demanding sufficient memory bandwidth across the threads to stall any gains, especially with dual channel memory.
The other problem I see is that for AVX to be effective, the inner loop vectors have to be in L1 cache to get high Gflop performance. This can best be achieved if there is repeated use of these vectors.
It is very rare that all of these targets can be achieved.
Most do concurrent examples (as with most of my calculations) appear to target large memory vectors with trivial calculations, which makes it difficult to identify multi-thread efficiency.
My main development machine is a Dell Precision workstation. It has an Intel I9 10980XE processor (18 cores/36 with hyper threading).
@cmaapic thanks for the update. This processor has 4 channel memory which might explain why your tests show some inprovement with OpenMP for the test calculation,unlike my 2 channel memory processor.
Interesting example, but I am not completely sure that array syntax is effective at vectorization. I’ve found several cases where inner loops are faster than array syntax. Have you measured the performance of your approach (2 nested loops plus array syntax) vs 3 nested loops?
@aledinola If you had the following, with appropriate compiler options and all arrays are contiguous, I would be confident that AVX would be used. (ie all vectors are contiguous)
c(1:nx,j,k) = alpha * a(1:nx,j,k) + b(1:nx,j,k)
However if there were non-contiguous vectors, something like the following layout, then loading the L1 cache would have delays
c(1:nx,j,k) = alpha * a(j,1:nx,k) + b(1:nx,j,k)
It would be interesting to understand these cases, as probably the inner loop would also have been unwound and replaced with AVX instructions.
I had prepared some time ago a comparison of loops vs array syntax in the context of a dynamic programming problem, see FortranVec/src/main.f90 at main · aledinola/FortranVec · GitHub
The loop-based code is in bellman_op, the code with array syntax is in bellman_op_vec and bellman_op_vec2. There was a discussion on this forum, see Performance of vectorized code in ifort and ifx
It turns out that this code
v_max = large_negative
ap_ind = 0
! Choose a' optimally by stepping through all possible values
do ap_c=1,n_a
aprime_val = ap_grid(ap_c)
cons = R*a_val + z_val - aprime_val
if (cons>0.0d0) then
v_temp = f_util(cons) + beta*EV(ap_c,z_c)
!v_temp = f_util(cons) + beta*sum(v(ap_c,:)*z_tran(z_c,:))
if (v_temp>v_max) then
v_max = v_temp
ap_ind = ap_c
end if
endif
enddo !end a'
is faster than this
cons = R*a_val + z_val - ap_grid ! (n_ap,1)
! NOTE: where and merge are slower than forall
! NOTE: forall and do concurrent are equivalent with ifort
! but do concurrent is very slow with ifx!
!where (cons>0.0d0)
! util = f_util(cons) ! (n_ap,1)
!elsewhere
! util = large_negative
!end where
!util = merge(f_util(cons),large_negative,cons>0.0d0)
! v_temp = large_negative
! do concurrent (ap_c=1:n_a, cons(ap_c)>0.0d0)
! v_temp(ap_c) = f_util(cons(ap_c))+beta*EV(ap_c,z_c)
! enddo
v_temp = large_negative
forall (ap_c=1:n_a, cons(ap_c)>0.0d0)
v_temp(ap_c) = f_util(cons(ap_c))+beta*EV(ap_c,z_c)
end forall
ap_ind = maxloc(v_temp,dim=1)
Please see my repo for more information (in the second block of code you can replace the forall with do concurrent or merge or where if you don’t like forall being obsolete)
Back then (in 2024) we realized that there were also interesting performance differences between ifort and ifx (thanks to @ivanpribec for measuring running times appropriately) with ifort being significantly faster. It would be interesting to run again this test to see if ifx has caught up.
@aledinola has shown us that FORALL is sometimes faster than DO CONCURRENT. Could its obsolescent status be removed?
Thanks for your comment, @Harper. I would say that the example that I posted above shows that in that setting array syntax is slower than loops. I’m not sure about the comparison forall vs do concurrent. There might be differences b/w ifort and ifx, plus last time I ran it was in 2024. It could be that more recent versions of the ifx compilers (or other compilers) improved the implementation of do concurrent.
In general this can happen for several reasons. The one being discussed here is whether one form compiles to AVX instructions and the other doesn’t. Another common reason is that the array syntax requires the the evaluation as if the rhs of the expression were completely evaluated before assignment to the lhs. Sometimes the compiler thinks that requires temporary array allocation, which is a slow operation. The DO CONCURRENT on the other hand tells the compiler that no such memory allocation is required, and it is the programmer’s fault if that is not really the case. In my opinion, this was a mistake in the design of the array syntax when it was introduced in f90. There should have been a way for the programmer to use the convenient array syntax AND ALSO to tell the compiler that no memory allocation is required to evaluate the expression. At the time array syntax was added, there were various compiler directives already available for the programmer to tell the compiler exactly that. What was needed at that time was for the fortran language to make that practice standard, ideally both with and without array syntax. The frustrating thing was even after this design mistake was recognized, it was never corrected, and here it is again, now in the context of AVX instructions, causing problems and confusion.
The two code examples are different, although the faster loop has more computation ?
The use of IF inside a DO loop can reduce the AVX efficiency.
What is the probability that cons <= 0 ? If low, perhaps f_util could better manage this, ie adopt strategies to eliminate IF and improve AVX efficiency.
Thanks @aledinola. I was relying on the second Note in your program’s comments.