Regarding the original question, the big thing I noticed was already mentioned, namely that you are looping through rows of arrays rather than columns.
The C swap() function looks odd. Why is swp[101][101] an array? Shouldn’t that be just a local scalar value? The same holds for the fortran version, why is temp(M,M) used to swap the two arrays, why not just swap them with a scalar temporary variable? Or, here is another idea. You could have three allocatable arrays, temp(:,:), phi(:,:), and phiprime(:,:). The swap operation can then be done as
call move_alloc(from=phi, to=temp)
call move_alloc(from=phiprime, to=phi)
call move_alloc(from=temp, to=phiprime)
This requires no actual data movement at all, and only two of the arrays are actually allocated at any moment in time. Is that what you are calling an unholy trick? This is fairly clear and is a normal thing in fortran. You should NOT do this instead with pointer arrays. Pointer arrays would prevent the compiler from optimizing the loops.
The expression a**2.0_dp should be written as a**2 or as a*a. This is just a minor part of the calculation, but in general floating point exponents should be avoided when possible.
I see several odd things about the rho() function. It is evaluated repeatedly within two nested loops and within the do while loop. However its values are all determined just from the value of the parameter constants a and epsilon0. In fact, the whole expression a2/4.0_dp/epsilon0*rho(i*a,j*a) looks like it could be a sparse array indexed by i and j and evaluated once outside of the do while loop. If I’m looking at the problem correctly, only 4% of the array has rho()==1.0, only 4% of the array has rho()==-1.0, and the rest of the array is zero. If that is right, then perhaps the calculation can be rearranged to evaluate just the part within the square where rho()==1.0, and just the part where rho()==-1.0, and ignore the part where rho()==0.0. This could be done, for example, by identifying the do loop ranges for the two squares (outside the do while loop), and using those ranges to update the phiprime(:,:) array.
real(dp), parameter :: rhovalue = a * a / 4.0_dp / epsilon0
...
phiprime(ilow1:ihigh1,jlow1:jhigh1) = phiprime(ilow1:ihigh1,jlow1:jhigh1) + rhovalue
phiprime(ilow2:ihigh2,jlow2:jhigh2) = phiprime(ilow2:ihigh2,jlow2:jhigh2) - rhovalue