Performance, C vs. Fortran

Actually, I ran it again a few times, and now Fortran seems faster. Now it seems to converge in 320k iterations (for 250x250) compared to C’s 304k, but is still faster by a bit less than a second. Also, I used epsilon = 8.85E-12, with the smaller epsilon it converges in 100k less iterations and is then obviously much faster.

Also, the C code becomes a few seconds slower if you allocate on the heap. Not sure how Fortran handles that behind the scenes (big arrays without allocate), but for a realistic numerical program you’d probably want to malloc in C.

1 Like

I’d be willing to give it a shot, if there are some guidelines on how to write these benchmarks. I also have a Cython version of this code, as well as pure python, and it would of course trivial to write a NumPy array slicing implementation.

Also, I got the looping Fortran version to run in 15 seconds now for 250x250, by following @implicitall 's lead and removing the useless temp array as well as precalculating rho array.This is now some 4 seconds faster than C with malloc.

2 Likes

We don’t have those yet, but I think it’s better to simply get started and go from there. Why don’t you submit a PR against the benchmarks repository, just create some directory and put the benchmarks there. Thanks!

Allright, I’ll get it done today, and specific format can be discussed later. Cheers!

2 Likes

Just a word of caution. Using -Ofast can have unexpected compute results in certain cases (loss of accuracy). I never use it. I generally use -O2 and sometimes -O3. Speed is not everything.

4 Likes

A nicely illustrated post by Matt Ferraro, discussed on Hacker News, covers methods to solve the Poisson equation, providing code snippets in Julia.

4 Likes

I just looked at the Poisson code today, and I am sorry to point out that the code solves the Poisson equation on a rectangle with an internal heat source on one interior rectangular patch, and an internal heat sink on another interior rectangular patch. So far, no major complaint. Unfortunately, there are no boundary conditions applied, which makes the problem ill posed. There is no way to judge whether the iterations, if/when they converge, produced a temperature distribution with any meaning or significance.

1 Like

@mecej4 are you talking about this code: benchmarks/poisson2d/naive.f90 at dea825a0ff4ae4e3d8fd1199c13170db05f0d005 · fortran-lang/benchmarks · GitHub ?

If so, it imposes zero Dirichlet condition at the boundary on the line 32:

phi(:,:) = 0.0_dp

Which is kept there in all iterations by skipping the boundary:

    do i=2, M-1
        do j=2, M-1

So I think the code is solving a well-posed problem.

1 Like

Ah, I see now,thanks. On the boundaries, the value is never changed because the loops run from 2 to M-1.

1 Like

Regarding the original question, the big thing I noticed was already mentioned, namely that you are looping through rows of arrays rather than columns.

The C swap() function looks odd. Why is swp[101][101] an array? Shouldn’t that be just a local scalar value? The same holds for the fortran version, why is temp(M,M) used to swap the two arrays, why not just swap them with a scalar temporary variable? Or, here is another idea. You could have three allocatable arrays, temp(:,:), phi(:,:), and phiprime(:,:). The swap operation can then be done as

call move_alloc(from=phi,      to=temp)
call move_alloc(from=phiprime, to=phi)
call move_alloc(from=temp,     to=phiprime)

This requires no actual data movement at all, and only two of the arrays are actually allocated at any moment in time. Is that what you are calling an unholy trick? This is fairly clear and is a normal thing in fortran. You should NOT do this instead with pointer arrays. Pointer arrays would prevent the compiler from optimizing the loops.

The expression a**2.0_dp should be written as a**2 or as a*a. This is just a minor part of the calculation, but in general floating point exponents should be avoided when possible.

I see several odd things about the rho() function. It is evaluated repeatedly within two nested loops and within the do while loop. However its values are all determined just from the value of the parameter constants a and epsilon0. In fact, the whole expression a2/4.0_dp/epsilon0*rho(i*a,j*a) looks like it could be a sparse array indexed by i and j and evaluated once outside of the do while loop. If I’m looking at the problem correctly, only 4% of the array has rho()==1.0, only 4% of the array has rho()==-1.0, and the rest of the array is zero. If that is right, then perhaps the calculation can be rearranged to evaluate just the part within the square where rho()==1.0, and just the part where rho()==-1.0, and ignore the part where rho()==0.0. This could be done, for example, by identifying the do loop ranges for the two squares (outside the do while loop), and using those ranges to update the phiprime(:,:) array.

real(dp), parameter :: rhovalue = a * a / 4.0_dp / epsilon0
...
phiprime(ilow1:ihigh1,jlow1:jhigh1) = phiprime(ilow1:ihigh1,jlow1:jhigh1) + rhovalue
phiprime(ilow2:ihigh2,jlow2:jhigh2) = phiprime(ilow2:ihigh2,jlow2:jhigh2) - rhovalue