DO CONCURRENT: compiler flags to enable parallelization

can you share what the loop was doing?

1 Like

Here is the source code. The figures are summary data.

include ‘integer_kind_module.f90’
include ‘precision_module.f90’

include ‘timing_module.f90’

program ch3305

use timing_module
use precision_module
use omp_lib

use iso_fortran_env

implicit none

! Original/Book 10,000,000
!
! Latest version has a variable memory footprint
!

!
! This version checks the memory allocation
! for each loop
!

integer :: n
integer :: allocation_status

integer, parameter :: start_size = 12810241024
integer, parameter :: loop_count = 10
integer, parameter :: n_types = 4
integer :: i
integer :: j
integer :: k
integer :: l
integer :: nthreads

real (dp), allocatable, dimension (:slight_smile: :: x
real (dp), allocatable, dimension (:slight_smile: :: y
real (dp), allocatable, dimension (:slight_smile: :: z

real, dimension (n_types, loop_count) :: timing_details = 0.0
real, dimension (n_types) :: t_sum = 0.0
real, dimension (n_types) :: t_average = 0.0
real :: reset = 0.0

character (15), dimension (n_types) :: heading_1 = &
[ ’ Whole array ', &
’ Do loop ', &
’ Do concurrent ', &
’ openmp ’ ]

print * , ’ ’
print *,compiler_version()
print *,compiler_options()
print *, ’ ’

nthreads = omp_get_max_threads()
open (unit=20, file=‘ch3305.dat’)
print 100, nthreads
100 format (’ Nthreads = ', i3)

!
! Dynamic allocation
!
! The loop count l depends on the amount
! of memory the system has.
!
! I use native Windows and Linux installs
! and for these systems the memory is the
! actual physical memory.
!
! I also use
!
! Linux under wsl
! Linux under hyper-v
!
! Both of these have less than the physical memory available.
!
! Adjust l accordingly
!
! I use the following values with the systems I have
!
! 192 GB system - l=8
! 32 GB system - l=6
! 16 GB system - l=3

l=8

do k=1,l

print *,‘’
call start_timing()
print *,‘’

n = k * start_size 

print *,''
print *,' Problem size = ' , k*128 , ' * 1024 * 1024'
print *,''

allocate (x(n),stat=allocation_status)

if (allocation_status > 0) then
print *,’ Allocation error’
print *,’ Program terminates’
stop 10
end if

allocate (y(n),stat=allocation_status)

if (allocation_status > 0) then
print *,’ Allocation error’
print *,’ Program terminates’
stop 20
end if

allocate (z(n),stat=allocation_status)

if (allocation_status > 0) then
print *,’ Allocation error’
print *,’ Program terminates’
stop 30
end if

!
! Initialisation
!

call random_number(x)
call random_number(y)
z = 0.0_dp

print 110, time_difference()
110 format (’ Initialise time = ', f12.6)
write (20, 120) x(1), y(1), z(1)
120 format (3(2x,f6.3))
print *, ’ ’

do j = 1, loop_count

print 130, j
130 format (' Iteration = ', i3)

!
! Whole array syntax
!

z = x + y
timing_details(1, j) = time_difference()
write (20, 120) x(1), y(1), z(1)
z = 0.0_dp
reset = time_difference()

!
! Simple traditional do loop
!

do i = 1, n
  z(i) = x(i) + y(i)
end do
timing_details(2, j) = time_difference()
z = 0.0_dp
reset = time_difference()

!
! do concurrent loop
!

do concurrent (i=1:n)
  z(i) = x(i) + y(i)
end do
timing_details(3, j) = time_difference()
write (20, 120) x(1), y(1), z(1)
z = 0.0_dp
reset = time_difference()

!
! OpenMP parallel loop
!

!$omp parallel do
do i = 1, n
z(i) = x(i) + y(i)
end do
!$omp end parallel do

timing_details(4, j) = time_difference()
write (20, 120) x(1), y(1), z(1)
z = 0.0_dp
reset = time_difference()

end do

print 140
140 format (15x, ’ Sum Average’)

do i = 1, n_types
t_sum(i) = sum(timing_details(i,1:loop_count))
t_average(i) = t_sum(i)/loop_count
print 150, heading_1(i), t_sum(i), t_average(i)
150 format (a, 2(3x,f12.6))
end do

deallocate (x)
deallocate (y)
deallocate (z)

print *, ’ ’
call end_timing()
print *,’ ’

end do

close (20)

end program

1 Like

Some asked for the include files. Here they are

module integer_kind_module
implicit none
integer, parameter :: i8 = selected_int_kind(2)
integer, parameter :: i16 = selected_int_kind(4)
integer, parameter :: i32 = selected_int_kind(9)
integer, parameter :: i64 = selected_int_kind(15)
end module

module precision_module

implicit none

! Precision support by compiler

! May 2025

! Number of bits 16 32 64 80 128
!
! NAG Y Y Y Y
! Intel Y Y Y
! gfortran Y Y Y Y
! Cray Y Y Y
! nvidia Y Y
! AMD Y Y Y
! Silverfrost Y Y Y
! nee Salford
! Sun/Oracle Y Y Y

! single, double, quad naming used by lapack.

! sp - single precision
! dp - double precision
! qp - quad precision
!
! We use

! hp - half precison

! The Fortran 2008 standard introduced

! REAL32
! REAL64
! REAL128

! The 2023 standard introduced

! REAL16

! Our default kind module

integer, parameter :: hp = selected_real_kind( 3, 4)
integer, parameter :: sp = selected_real_kind( 6, 37)
integer, parameter :: dp = selected_real_kind(15, 307)
integer, parameter :: qp = selected_real_kind(30, 291)

! We use

! integer, parameter :: r80 = selected_real_kind(18, 291)

! for 80 bit support in the gfortran and
! Salford/Silverfrost compilers

end module

module timing_module

use integer_kind_module
use precision_module

implicit none

integer, dimension (8), private :: dt

real (dp) :: r_count
real (dp) :: r_count_rate

real (dp) :: start_time = 0.0_dp
real (dp) :: end_time = 0.0_dp
real (dp) :: last_time = 0.0_dp
real (dp) :: total_time = 0.0_dp

real (dp) :: difference = 0.0_dp

integer (i64) :: count,count_rate,count_max
integer (i64) , parameter :: nag_count_rate = 10000000
integer (i64) , parameter :: gfortran_count_rate = 1000000000
integer (i64) , parameter :: intel_count_rate = 1000000

contains

subroutine start_timing()

implicit none

call date_and_time(values=dt)
print 100, dt(1:3), dt(5:8)
100 format (1x, i4, '/', i2, '/', i2, 1x, i2, ':', i2, ':', i2, 1x, i3)

call system_clock(count,count_rate,count_max)
r_count      = count
r_count_rate = count_rate
start_time   = r_count/r_count_rate
last_time    = start_time

end subroutine start_timing

subroutine end_timing()

implicit none

call date_and_time(values=dt)
print 100, dt(1:3), dt(5:8)
100 format (1x, i4, '/', i2, '/', i2, 1x, i2, ':', i2, ':', i2, 1x, i3)

call system_clock(count)
r_count      = count
end_time     = r_count/r_count_rate

total_time   = end_time - start_time
print 200,total_time
200 format(' Total time = ',12x,f18.6)

end subroutine end_timing

subroutine print_time_difference()

implicit none

call system_clock(count)
r_count      = count
end_time     = r_count/r_count_rate

difference = end_time - last_time
last_time       = end_time

print 100, difference
100 format(' Time difference =  ',f18.6)

end subroutine print_time_difference

function time_difference()

implicit none

real (dp) :: time_difference

call system_clock(count)
r_count      = count
end_time     = r_count/r_count_rate

time_difference = end_time - last_time
last_time       = end_time

end function time_difference

end module

I aplogise for the lost formatting. I obviously got the use of the </> wrong.

@cmaapic Thanks for the code.

I tried to use it on my Ryzen 5900X which has dual channel memory and the multi thread testing showed that the do loop being used

      do i = 1, n
        z(i) = x(i) + y(i)
      end do

This loop is not suitable for multi-thread testing as there is insufficient computation per cycle, so all my test is showing is performance is limited by memory bandwidth.
( When n is much larger than L3 cache memory bandwidth is the problem, while when n is small the multi-threading overhead, OpenMP or other approach, out weighs any calculation advantage. )
This is the result I would expect using multi-thread DO CONCURRENT on my Ryzen 5900X.
The problem I see is that there needs to be more calculation per itteration or less average memory access per cycle.
I do not have available quad channel or 8-channel memory processors to see if they are more suitable for DO C loops.
Assuming there is no error in my testing approach, I would conclude that similar do concurrent loops (with little computation per cycle) are not suitable for multi-thread off loading of do concurrent loops. Which makes me wonder what type of DO C loops are suitable ? Or what types of hardware ? It looks to me that the sweet spot for DO C off-loading is very limited.

@cmaapic, what processor type have you used for your tests ?

The following is the modified test code I used, where I have used increased annotation of !$OMP to identify a performance improvement.

 module integer_kind_module
  implicit none
  integer, parameter :: i8  = selected_int_kind(2)
  integer, parameter :: i16 = selected_int_kind(4)
  integer, parameter :: i32 = selected_int_kind(9)
  integer, parameter :: i64 = selected_int_kind(15)
 end module integer_kind_module

 module precision_module

  implicit none

! Precision support by compiler

! May 2025

! Number of bits 16 32 64 80 128
!
! NAG         Y Y Y Y
! Intel       Y Y Y
! gfortran    Y Y Y Y
! Cray        Y Y Y
! nvidia      Y Y
! AMD         Y Y Y
! Silverfrost Y Y Y
! nee Salford
! Sun/Oracle  Y Y Y

! single, double, quad naming used by lapack.

! sp - single precision
! dp - double precision
! qp - quad precision
!
! We use

! hp - half precison

! The Fortran 2008 standard introduced

! REAL32
! REAL64
! REAL128

! The 2023 standard introduced

! REAL16

! Our default kind module

  integer, parameter :: hp = selected_real_kind( 3, 4)
  integer, parameter :: sp = selected_real_kind( 6, 37)
  integer, parameter :: dp = selected_real_kind(15, 307)
  integer, parameter :: qp = selected_real_kind(30, 291)

! We use

! integer, parameter :: r80 = selected_real_kind(18, 291)

! for 80 bit support in the gfortran and
! Salford/Silverfrost compilers

 end module precision_module

 module timing_module

  use integer_kind_module
  use precision_module

  implicit none

  real (dp) :: r_count
  real (dp) :: r_count_rate
  
  real (dp) :: start_time = 0.0_dp
  real (dp) :: end_time   = 0.0_dp
  real (dp) :: last_time  = 0.0_dp
  real (dp) :: total_time = 0.0_dp
  
  real (dp) :: difference = 0.0_dp
  
  integer (i64) :: count,count_rate,count_max
  integer (i64) , parameter :: nag_count_rate = 10000000
  integer (i64) , parameter :: gfortran_count_rate = 1000000000
  integer (i64) , parameter :: intel_count_rate = 1000000
  
  contains

   subroutine start_timing()

    implicit none

    call print_date_and_time

    call system_clock(count,count_rate,count_max)
    r_count      = count
    r_count_rate = count_rate
    start_time   = r_count/r_count_rate
    last_time    = start_time

   end subroutine start_timing

   subroutine end_timing()

    implicit none

    call print_date_and_time

    call system_clock(count)
    r_count      = count
    end_time     = r_count/r_count_rate
    
    total_time   = end_time - start_time
    print 200,total_time
200 format(' Total time = ',12x,f18.6)
   end subroutine end_timing

   function time_difference ()
   
    implicit none
   
    real (dp) :: time_difference
   
    call system_clock (count)
    r_count      = count
    end_time     = r_count/r_count_rate
    
    time_difference = end_time - last_time
    last_time       = end_time
   end function time_difference

   subroutine print_time_difference ()
    implicit none

    difference  = time_difference ()
    print 100, difference
100 format(' Time difference =  ',f18.6)
   end subroutine print_time_difference

   subroutine print_date_and_time
    integer, dimension (8) :: dt    ! , private only in module
   
    call date_and_time(values=dt)
    print 100, dt(1:3), dt(5:8)
100 format (1x, i4, '/', i2, '/', i2, 1x, i2, ':', i2, ':', i2, 1x, i3)
   end subroutine print_date_and_time

 end module timing_module

!  include ‘integer_kind_module.f90'
!  include ‘precision_module.f90'
!  include ‘timing_module.f90'

 program ch3305

  use timing_module
  use precision_module
  use omp_lib

  use iso_fortran_env

  implicit none

! Original/Book 10,000,000
!
! Latest version has a variable memory footprint
!
!
! Dynamic allocation
!
! The loop count l depends on the amount
! of memory the system has.
!
! I use native Windows and Linux installs
! and for these systems the memory is the
! actual physical memory.
!
! I also use
!
! Linux under wsl
! Linux under hyper-v
!
! Both of these have less than the physical memory available.
!
! Adjust size_count accordingly
!
! I use the following values with the systems I have
!
! 192 GB system - size_count=8
! 32 GB system - size_count=6
! 16 GB system - size_count=3

  integer, parameter :: size_count = 7 ! 64 GBytes

!
! This version checks the memory allocation
! for each loop
!

  integer :: n
  integer :: allocation_status

  integer, parameter :: start_size = 128*1024*1024    ! 128 mega variables
!  integer, parameter :: start_size = 16*1024*1024    ! 16 million for quicker tests
  integer, parameter :: loop_count = 10
  integer, parameter :: n_types = 4
  integer :: i
  integer :: j
  integer :: k
  integer :: nthreads, threads_to_use
  
  real (dp), allocatable, dimension (:) :: x
  real (dp), allocatable, dimension (:) :: y
  real (dp), allocatable, dimension (:) :: z
  real (dp) :: zz
  
  real (dp), dimension (loop_count, n_types) :: timing_details = 0.0
  real (dp), dimension (size_count, n_types) :: t_sum = 0.0
  real (dp), dimension (size_count, n_types) :: t_average = 0.0
  real (dp) :: reset = 0.0
  real :: gbytes
  
  character (15), dimension (n_types) :: heading_1 = &
  [ ' Whole array  ', &
    ' Do loop      ', &
    ' Do concurrent', &
    ' openmp       ' ]

  print * , ' '
  print *,compiler_version()
  print *,compiler_options()
  print *, ' '

  nthreads = omp_get_max_threads()
  threads_to_use = min ( 6, nthreads )  !  limit threads to better see change
  print 100, nthreads, threads_to_use
100 format (' Max threads    = ', i3 /   &
            ' threads_to_use = ', i3 )

  open (unit=20, file='ch3305.dat')

  do k=1,size_count

    print *,' '
    call start_timing()
    print *,' '
    
    n = k * start_size 
    gbytes = n * 3. * 8. / 1024. / 1024. / 1024.
    write (*,*) gbytes, ' gbytes to be allocated',k
    
    print *,' '
    print *,' Problem size = ' , k*128 , ' * 1024 * 1024'
    print *,' '
    allocate (x(n),stat=allocation_status)
    
    if (allocation_status > 0) then
      print *,' Allocation error'
      print *,' Program terminates'
      stop 10
    end if
    
    allocate (y(n),stat=allocation_status)
    
    if (allocation_status > 0) then
      print *,' Allocation error'
      print *,' Program terminates'
      stop 20
    end if
    
    allocate (z(n),stat=allocation_status)
    
    if (allocation_status > 0) then
      print *,' Allocation error'
      print *,' Program terminates'
      stop 30
    end if

!
! Initialisation
!

    call random_number(x)
    call random_number(y)
    z = 0.0_dp
    
    timing_details = 0
    print 110, time_difference ()
110 format (' Initialise time = ', f12.6)
    write (20, 120) x(1), y(1), z(1)
120 format (3(2x,f6.3))
    print *, ' '
    
    do j = 1, loop_count
    
      print 130, j
  130 format (' Iteration = ', i3)
!
! Whole array syntax
!

      z = x + y
      timing_details(j, 1) = time_difference ()
      write (20, 120) x(1), y(1), z(1)
      z = 0.0_dp
      reset = time_difference ()
!
! Simple traditional do loop
!

      do i = 1, n
        z(i) = x(i) + y(i)
      end do
      timing_details(j, 2) = time_difference ()
      z = 0.0_dp
      reset = time_difference ()
!
! do concurrent loop
!

      do concurrent (i=1:n)
        z(i) = x(i) + y(i)
      end do
      timing_details(j, 3) = time_difference ()
      write (20, 120) x(1), y(1), z(1)
      z = 0.0_dp
      reset = time_difference ()
!
! OpenMP parallel loop
!
    zz = 0
    !$OMP   PARALLEL DO DEFAULT (NONE)    &
    !$OMP&  SHARED (n,x,y,z) PRIVATE (i)  &
!    !$OMP&  REDUCTION(+ : zz)             &
    !$OMP&  SCHEDULE (STATIC)             &
    !$OMP&  num_threads (threads_to_use) 
      do i = 1, n
        z(i) = x(i) + y(i)
!        zz = zz + x(i) + y(i)
      end do
    !$OMP END PARALLEL DO 
      
      timing_details(j, 4) = time_difference ()
      write (20, 120) x(1), y(1), z(1)   ! zz 
      z = 0.0_dp
      reset = time_difference ()

    end do   ! j loop_count

    print 140
140 format (15x, ' Sum Average')

    do i = 1, n_types
      t_sum(k,i)     = sum( timing_details(1:loop_count,i) )
      t_average(k,i) = t_sum(k,i)/loop_count
      print 150, heading_1(i), t_sum(k,i), t_average(k,i)
  150 format (a, 2(3x,f12.6))
    end do

    deallocate (x)
    deallocate (y)
    deallocate (z)

    print *, ' '
    call end_timing()
    print *,' '

  end do    ! k size_count

  do i = 1, n_types
    print 160, heading_1(i), (t_average(k,i), k=1,size_count)
160 format (a, 20f12.6 )
  end do

  close (20)

 end program ch3305
````Preformatted text`

This is why I like more using:

do concurrent(j=1:ny,k=1:nk,i=1:nx)
 c(i,j,k) = alpha * a(i,j,k) + b(i,j,k)
end do 

I’d rather be doing cubic work :slight_smile: (a lot of this work is to demonstrate the capabilities for 3d type loops. A vector like loop will only be good if the dimension of the vector is long enough, of course

1 Like

@jorgeg
I still see the same trivial calculation per itteration.

I am attempting to use the following MATMUL as a test, but is it a valid do concurrent loop ?

      z = 0.0_dp
      do concurrent (ii=1:n,jj=1:n,kk=1:n)
        z(ii,kk) = z(ii,kk) + x(ii,jj) * y(jj,kk)
      end do

The way OpenMP can help here is to use an if as cutoff:

      !$omp parallel do if(n > 100000)
      do i = 1, n
        z(i) = x(i) + y(i)
      end do

The second alternative would be to use a “thread throttling” function get_optimal_nthreads (derived empirically) like:

      !$omp parallel do num_threads(get_optimal_nthreads(n))
      do i = 1, n
        z(i) = x(i) + y(i)
      end do

Some of the performance libraries, like OpenBLAS do such things, for instance: OpenBLAS/kernel/arm64/dot.c at f6df9bebbb4259aa61ab5634c0f1269fb152cc0e · OpenMathLib/OpenBLAS · GitHub.

It gets complicated to do this for multiple CPU architectures.

1 Like

Would you know if there’s a nice way to print out the information of what the compiler is doing with the omp target / do concurrent loops in a similar way to nvidia’s -Minfo= ? I can get the optimization log, but it is 15k lines long and very very detailed (this is using -Rpass-analysis or -Rpass ). If there’s not, where can I make some noise to have something similar? it is very useful to see how things are being parallelized without having to look at the assembly haha

This is an interesting example, as if using !$OMP, my preferred approach is:

!$OMP parallel do shared (alpha,a,b,c,ny,nk) private(j,k) schedule(static)
  do k=1,nk
   do j=1,ny
    c(:,j,k) =  alpha * a(:,j,k) + b(:,j,k)
   end do
  end do
!$OMP end parallel do

I am assuming :

  1. nk is larger than num_threads, so that collapse is not effective or required ( I think collapse is over-used)
  2. the inner loop i=1:nx is replaced by array syntax, to highlight that avx instructions would be used
  3. my explicit do order enhances sequential memory addressing, which I am wondering if do concurrent would always adjust.

The problem with this example is again can the calculation be sufficient to overcome the Openmp startup, while not demanding sufficient memory bandwidth across the threads to stall any gains, especially with dual channel memory.

The other problem I see is that for AVX to be effective, the inner loop vectors have to be in L1 cache to get high Gflop performance. This can best be achieved if there is repeated use of these vectors.
It is very rare that all of these targets can be achieved.
Most do concurrent examples (as with most of my calculations) appear to target large memory vectors with trivial calculations, which makes it difficult to identify multi-thread efficiency.

1 Like

My main development machine is a Dell Precision workstation. It has an Intel I9 10980XE processor (18 cores/36 with hyper threading).

@cmaapic thanks for the update. This processor has 4 channel memory which might explain why your tests show some inprovement with OpenMP for the test calculation,unlike my 2 channel memory processor.

Interesting example, but I am not completely sure that array syntax is effective at vectorization. I’ve found several cases where inner loops are faster than array syntax. Have you measured the performance of your approach (2 nested loops plus array syntax) vs 3 nested loops?

1 Like

@aledinola If you had the following, with appropriate compiler options and all arrays are contiguous, I would be confident that AVX would be used. (ie all vectors are contiguous)
c(1:nx,j,k) = alpha * a(1:nx,j,k) + b(1:nx,j,k)
However if there were non-contiguous vectors, something like the following layout, then loading the L1 cache would have delays
c(1:nx,j,k) = alpha * a(j,1:nx,k) + b(1:nx,j,k)

It would be interesting to understand these cases, as probably the inner loop would also have been unwound and replaced with AVX instructions.

1 Like

I had prepared some time ago a comparison of loops vs array syntax in the context of a dynamic programming problem, see FortranVec/src/main.f90 at main · aledinola/FortranVec · GitHub

The loop-based code is in bellman_op, the code with array syntax is in bellman_op_vec and bellman_op_vec2. There was a discussion on this forum, see Performance of vectorized code in ifort and ifx

It turns out that this code

v_max = large_negative
            ap_ind = 0
            ! Choose a' optimally by stepping through all possible values
            do ap_c=1,n_a
                aprime_val = ap_grid(ap_c)
                cons = R*a_val + z_val - aprime_val
                if (cons>0.0d0) then
                    v_temp = f_util(cons) + beta*EV(ap_c,z_c)
                    !v_temp = f_util(cons) + beta*sum(v(ap_c,:)*z_tran(z_c,:))
                    if (v_temp>v_max) then
                        v_max = v_temp
                        ap_ind = ap_c
                    end if
                endif
            enddo !end a'

is faster than this

cons = R*a_val + z_val - ap_grid ! (n_ap,1)
            ! NOTE: where and merge are slower than forall
            ! NOTE: forall and do concurrent are equivalent with ifort
            ! but do concurrent is very slow with ifx!
            !where (cons>0.0d0)
            !    util = f_util(cons)  ! (n_ap,1)
            !elsewhere
            !    util = large_negative
            !end where
            !util = merge(f_util(cons),large_negative,cons>0.0d0)
            ! v_temp = large_negative
            ! do concurrent (ap_c=1:n_a, cons(ap_c)>0.0d0)
            !    v_temp(ap_c) = f_util(cons(ap_c))+beta*EV(ap_c,z_c)
            ! enddo
            v_temp = large_negative
            forall (ap_c=1:n_a, cons(ap_c)>0.0d0)
               v_temp(ap_c) = f_util(cons(ap_c))+beta*EV(ap_c,z_c)
            end forall
            ap_ind = maxloc(v_temp,dim=1)

Please see my repo for more information (in the second block of code you can replace the forall with do concurrent or merge or where if you don’t like forall being obsolete)

Back then (in 2024) we realized that there were also interesting performance differences between ifort and ifx (thanks to @ivanpribec for measuring running times appropriately) with ifort being significantly faster. It would be interesting to run again this test to see if ifx has caught up.

@aledinola has shown us that FORALL is sometimes faster than DO CONCURRENT. Could its obsolescent status be removed?

2 Likes

Thanks for your comment, @Harper. I would say that the example that I posted above shows that in that setting array syntax is slower than loops. I’m not sure about the comparison forall vs do concurrent. There might be differences b/w ifort and ifx, plus last time I ran it was in 2024. It could be that more recent versions of the ifx compilers (or other compilers) improved the implementation of do concurrent.

In general this can happen for several reasons. The one being discussed here is whether one form compiles to AVX instructions and the other doesn’t. Another common reason is that the array syntax requires the the evaluation as if the rhs of the expression were completely evaluated before assignment to the lhs. Sometimes the compiler thinks that requires temporary array allocation, which is a slow operation. The DO CONCURRENT on the other hand tells the compiler that no such memory allocation is required, and it is the programmer’s fault if that is not really the case. In my opinion, this was a mistake in the design of the array syntax when it was introduced in f90. There should have been a way for the programmer to use the convenient array syntax AND ALSO to tell the compiler that no memory allocation is required to evaluate the expression. At the time array syntax was added, there were various compiler directives already available for the programmer to tell the compiler exactly that. What was needed at that time was for the fortran language to make that practice standard, ideally both with and without array syntax. The frustrating thing was even after this design mistake was recognized, it was never corrected, and here it is again, now in the context of AVX instructions, causing problems and confusion.

2 Likes

The two code examples are different, although the faster loop has more computation ?

The use of IF inside a DO loop can reduce the AVX efficiency.
What is the probability that cons <= 0 ? If low, perhaps f_util could better manage this, ie adopt strategies to eliminate IF and improve AVX efficiency.

1 Like

Thanks @aledinola. I was relying on the second Note in your program’s comments.

1 Like