OpenMP nested loop task parallelism, counter not giving correct result

Question

I am pretty new in openMP. I am trying to parallelize the nested loop using tasking but it didn't give me the correct counter output. Sequential output is "Total pixel = 100000000". Can anyone help me with that?

Note: I have done this using #pragma omp parallel for reduction (+:pixels_inside) private(i,j). This works fine now I want to use tasking.

what I have try so far:

#include<iostream>
#include<omp.h>
using namespace std;

int main(){
    int total_steps = 10000;

    int i,j;
    int pixels_inside=0;
    omp_set_num_threads(4);
    //#pragma omp parallel for reduction (+:pixels_inside) private(i,j)
    #pragma omp parallel
    #pragma omp single private(i)
    for(i = 0; i < total_steps; i++){
        #pragma omp task private(j)
        for(j = 0; j < total_steps; j++){
            pixels_inside++;
        }
    }

    cout<<"Total pixel = "<<pixels_inside<<endl;
    return 0;
}

CoffeeLexer · Accepted Answer · 2021-11-27 10:21:11Z

2

First of all you need to declare for OpenMP what variables you are using and what protection do they have. Generally speaking your code has default(shared) as you didn't specified otherwise. This makes all variables accessible with same memory location for all threads. You should use something like this:

#pragma omp parallel default(none) shared(total_steps, pixels_inside)
[...]
#pragma omp task private(j) default(none) shared(total_steps, pixels_inside)

Now, only what is necessary will be used by threads.

Secondly the main problem is that you don't have critical section protection. What this means, that when threads are running they may wish to use shared variable and race condition happens. For example, you have thread A and B with variable x accessible to both (a.k.a. shared memory variable). Now lets say A adds 2 and B adds 3 to the variable. Threads aren't same speed so this may happen, A takes x=0, B takes x=0, A adds 0+2, B adds 0+3, B returns data to memory location x=3, A returns data to memory location x=2. In end x = 2. The same happens with pixels_inside, as thread takes variable, adds 1 and returns it back from where it got it. To overcome this you use measurements to insure critical section protection:

#pragma omp critical
{
    //Code with shared memory
    pixels_inside++;
}

You didn't needed critical section protection in reduction as variables in recution parameters have this protection.

Now your code should look like this:

#include <iostream>
#include <omp.h>
using namespace std;

int main() {
    int total_steps = 10000;

    int i,j;
    int pixels_inside=0;
    omp_set_num_threads(4);
//#pragma omp parallel for reduction (+:pixels_inside) private(i,j)
#pragma omp parallel default(none) shared(total_steps, pixels_inside)
#pragma omp single private(i)
    for(i = 0; i < total_steps; i++){
#pragma omp task private(j) default(none) shared(total_steps, pixels_inside)
        for(j = 0; j < total_steps; j++){
#pragma omp critical
            {
                pixels_inside++;
            }
        }
    }

    cout<<"Total pixel = "<<pixels_inside<<endl;
    return 0;
}

Although I would suggest using reduction as it has better performance and methods to optimize that kind of calculations.

answered Nov 27, 2021 at 10:21

CoffeeLexer

1011 silver badge3 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Dev. R Over a year ago

Thanks for your answer. This really makes sense to me. But It seems that the performance is not really good. Which way is the best to solve this kind of problem? Please give me some suggestions.

CoffeeLexer Over a year ago

I would use this. 2 layer reduction has less bottlecap on than single critical section. #pragma omp parallel for reduction(+:pixels_inside) default(none) shared(total_steps) for(i = 0; i < total_steps; i++) { int private_sum = 0; #pragma omp parallel for reduction(+:private_sum) default(none) shared(total_steps) for(j = 0; j < total_steps; j++){ private_sum++; } pixels_inside += private_sum; }

Jérôme Richard Over a year ago

A critical section for just an increment is awful. Using atomic updates is significantly better. Still, both are executed sequentially on most architectures and are slower than a sequential code because of cache line bouncing between cores. Indeed, a reduction is much better. Note that the code is barely readable as a comment and I think editing the answer is much better (for future readers).

Laci · Accepted Answer · 2021-11-27 11:53:27Z

2

As @tartarus already explained you have a race condition in your code and it is much better to avoid it by using reduction. If you what to do the same as #pragma omp parallel for reduction (+:pixels_inside) private(i,j) do but using tasks, you have to use the following:

    #pragma omp parallel 
    #pragma omp single    
    #pragma omp taskloop reduction (+:pixels_inside) private(i,j)
    for(i = 0; i < total_steps; i++){
        for(j = 0; j < total_steps; j++){
            pixels_inside++;
        }
    }

In this version fewer tasks are created and reduction is used instead of critical section, therefore the performance will be much better (similar to what you can obtain by using #pragma omp parallel for)

UPDATE(comment on performance): I guess it is just a simplified example not your real code to parallelize. If the performance gain is not good enough, most probably it means that the parallel overhead is bigger than the work to do. In this case try to parallelize bigger part of your code. Note that parallel overheads are typically bigger in case of tasks (compared to #pragma omp parallel for).

edited Nov 27, 2021 at 11:53

answered Nov 27, 2021 at 11:38

Laci

2,8381 gold badge15 silver badges25 bronze badges

3 Comments

Jérôme Richard Over a year ago

Note that some OpenMP runtimes are not very clever and generate 1 task per loop iteration that may impact a lot the performances. Fortunately, the granularity of the taskloop can be controlled using additional clauses: grainsize and num_tasks.

Laci Over a year ago

Thanks for the clarification. Which OpenMP runtime generate only one task? Using recent gcc and clang I did not notice such an issue.

Jérôme Richard Over a year ago

Indeed! I think it was ICC (or possibly GCC) few years ago. Such a behavior is generally implemented in the runtime and not the compiler (at least for GCC, Clang and ICC). ICC uses libOMP like Clang, so they have probably improved that since. I see that they have made some change quite "recently" on the schedule of parallel loops (including taskloops). Glad to see they improved the performance of taskloops :) .

Collectives™ on Stack Overflow

OpenMP nested loop task parallelism, counter not giving correct result

2 Answers 2

3 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related