simple OpenMP parallel for loop slower than serial computation

Question

I am new to parallelization, and I hope I'm not wasting anyone's time. I already asked a few friends that already used openMP, but they could not help me. So I guessed my case could be interesting for someone else too, at least for educational purposes, and I tried to document it as good as I could. These are two examples, one of them 100% taken from Tim Mattson's tutorials in youtube, the other one somehow simplified, but still kind of a standard approach I guess. In both cases the computation time scales with the number of threads for few iterations, but for a very large number of iterations the computation time seems to converge to the same number. This is of course wrong, since I would expect the computation time to be similar for few iterations, and really optimized for a large number of iterations.

Here the two examples, both compiled with

g++ -fopenmp main.cpp -o out

Thread model: posix gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04), on Ubuntu 14.04 and with the following header:

#include <omp.h> 
#include <stdio.h> 
#include <stdlib.h> 
#include <chrono>
#include <iostream>

using namespace std;


#define NUMBER_OF_THREADS 2
static long num_steps = 1000000000;

Now, the number of cores on the computer I'm working on right now is 8 (intel i7), so any number of threads between 2 and 4 I would have expected to bring some big advantage in terms of computation time.

Example 1:

int main() { 

omp_set_num_threads(NUMBER_OF_THREADS);
double step = 1.0/(double) num_steps, pi=0.0;

auto begin = chrono::high_resolution_clock::now();

#pragma omp parallel 
{ 
    int i, ID, nthrds;
    double x, sum = 0; 

    ID = omp_get_thread_num();
    nthrds = omp_get_num_threads();

    for (i=ID; i<num_steps; i=i+nthrds) { 
        x = (i+0.5)*step; 
        sum = sum + 4.0/(1.0+x*x); 
    } 

    #pragma omp critical
    pi += step*sum; 
} 

auto end = chrono::high_resolution_clock::now();
cout << chrono::duration_cast<chrono::nanoseconds>(end-begin).count()/1e6 << "ms\n";

return 0; 

}

Example 2:

int main() { 

    omp_set_num_threads(NUMBER_OF_THREADS);
    double pi=0, sum = 0; 
    const double step = 1.0/(double) num_steps;

    auto begin = chrono::high_resolution_clock::now();

    // #pragma omp parallel 
    { 
        #pragma omp parallel for reduction(+:sum)
        for (int i=0; i<num_steps; i++) { 
            double x = (i+0.5)*step; 
            sum += 4.0/(1.0+x*x); 
        } 
    } 

    pi += step*sum; 

    auto end = std::chrono::high_resolution_clock::now();
    cout << chrono::duration_cast<chrono::nanoseconds>(end-begin).count()/1e6 << "ms\n";

    return 0; 

}

Now, I thought at the beginning that example 2 is slowed down by the reduction of the variable, which disturbs the parallelization, but in the example 1 there is almost nothing shared. Let me know if I'm doing something really dumb, or if I can specify more aspects of the problem. Thanks to all.

Use omp_get_wtime(), not clock(), the former return the elapsed wall time, while the later return the CPU time of the current thread and its children — Gilles
– Gilles, Commented Oct 24, 2015 at 15:54
Hi Gilles, thanks for the tip, indeed I didn't know. Apparently chrono should do the same with high_resolution_clock::now(); cheers — nico.ocin
– nico.ocin, Commented Oct 24, 2015 at 16:11
Hi Gilles, apparently that was exactly my problem, now with this measuring it counts it works ;) thanks a lot! — nico.ocin
– nico.ocin, Commented Oct 24, 2015 at 16:22
@PeterCordes, fast math is a good suggestion. OpenMP won't vectorize the loop without it due to the dependency chain. The OP could also do #pragma omp parallel for simd reduction(+:sum) I think. That should vectorize the loop as well without the fast math option. Fast math enables many other optimizations and it applies to the entire translation unit which may or may not be desirable whereas omp simd just allows associative math and only for the parallel block it applies to. — Z boson
– Z boson, Commented Oct 25, 2015 at 8:41
@PeterCordes, yeah, I justed checked the assembly to make sure. OpenMP does not vectorize the reduction unless either omp simd is used or fast math is enabled. On the other hand it OpenMP already assumes associative math otherwise it would not be able to parallelize the loop. — Z boson
– Z boson, Commented Oct 25, 2015 at 8:59

nico.ocin · Accepted Answer · 2015-10-24 16:25:27Z

5

As posted by gilles in the comments, the problem was that i was measuring time with clock(), which adds up all the tics of the cores. with

chrono::high_resolution_clock::now();

i get the expected speed-up.

for me the question is cleared, but maybe we can leave this as an example for future noobs like me to be referred to. If some mod believes otherwhise the post can be eliminated. Thanks again for the help

answered Oct 24, 2015 at 16:25

nico.ocin

1018 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

simple OpenMP parallel for loop slower than serial computation

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related