I am new to parallelization, and I hope I'm not wasting anyone's time. I already asked a few friends that already used openMP, but they could not help me. So I guessed my case could be interesting for someone else too, at least for educational purposes, and I tried to document it as good as I could. These are two examples, one of them 100% taken from Tim Mattson's tutorials in youtube, the other one somehow simplified, but still kind of a standard approach I guess. In both cases the computation time scales with the number of threads for few iterations, but for a very large number of iterations the computation time seems to converge to the same number. This is of course wrong, since I would expect the computation time to be similar for few iterations, and really optimized for a large number of iterations.
Here the two examples, both compiled with
g++ -fopenmp main.cpp -o out
Thread model: posix gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04), on Ubuntu 14.04 and with the following header:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <chrono>
#include <iostream>
using namespace std;
#define NUMBER_OF_THREADS 2
static long num_steps = 1000000000;
Now, the number of cores on the computer I'm working on right now is 8 (intel i7), so any number of threads between 2 and 4 I would have expected to bring some big advantage in terms of computation time.
Example 1:
int main() {
omp_set_num_threads(NUMBER_OF_THREADS);
double step = 1.0/(double) num_steps, pi=0.0;
auto begin = chrono::high_resolution_clock::now();
#pragma omp parallel
{
int i, ID, nthrds;
double x, sum = 0;
ID = omp_get_thread_num();
nthrds = omp_get_num_threads();
for (i=ID; i<num_steps; i=i+nthrds) {
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
#pragma omp critical
pi += step*sum;
}
auto end = chrono::high_resolution_clock::now();
cout << chrono::duration_cast<chrono::nanoseconds>(end-begin).count()/1e6 << "ms\n";
return 0;
}
Example 2:
int main() {
omp_set_num_threads(NUMBER_OF_THREADS);
double pi=0, sum = 0;
const double step = 1.0/(double) num_steps;
auto begin = chrono::high_resolution_clock::now();
// #pragma omp parallel
{
#pragma omp parallel for reduction(+:sum)
for (int i=0; i<num_steps; i++) {
double x = (i+0.5)*step;
sum += 4.0/(1.0+x*x);
}
}
pi += step*sum;
auto end = std::chrono::high_resolution_clock::now();
cout << chrono::duration_cast<chrono::nanoseconds>(end-begin).count()/1e6 << "ms\n";
return 0;
}
Now, I thought at the beginning that example 2 is slowed down by the reduction of the variable, which disturbs the parallelization, but in the example 1 there is almost nothing shared. Let me know if I'm doing something really dumb, or if I can specify more aspects of the problem. Thanks to all.
omp_get_wtime(), notclock(), the former return the elapsed wall time, while the later return the CPU time of the current thread and its children#pragma omp parallel for simd reduction(+:sum)I think. That should vectorize the loop as well without the fast math option. Fast math enables many other optimizations and it applies to the entire translation unit which may or may not be desirable whereasomp simdjust allows associative math and only for the parallel block it applies to.omp simdis used or fast math is enabled. On the other hand it OpenMP already assumes associative math otherwise it would not be able to parallelize the loop.