3

I have written some programs with OMP reduction directive in Fortran and in C. But the data types were simple (int, float, arrays with fixed size, ...) and reduction-identifiers used were implicitly declared (+,*,- ...). Now, I would like to go a step beyond and understand deeply how OMP reduction works. I know only that a local copy is done and at the end of the parallel part, all the local copies are combined.

Considering the following single threaded program (This is a toy example) :

// Toy example : Non Parallel Version
#include <stdlib.h>
#include <stdio.h>

void my_add(int *res, int arr_sz) {
  for (int i=0; i<arr_sz; i++)
    res[i] += i;
}

int main(void) {

  int arr_sz=10; // Input parameter in real program
  int *my_array;
  my_array = malloc (sizeof(int)*arr_sz);
  for (int i=0; i<arr_sz; i++) my_array[i]=0; // init array
        
  for (int iloop=0; iloop<1024; iloop++) 
    my_add(my_array, arr_sz);

  for (int i=0; i<arr_sz; i++) printf("%d : %d\n",i,my_array[i]); // print array
  free(my_array); 

  return 0;
}

I would like to parallelize the for loop using a user defined reduction on an allocatable array and user reduction function. But I have many problems. This is an example of code to discuss (of course it does not compile):

// Toy example : Parallel Version
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

void my_add(int *res, int arr_sz) {
  for (int i=0; i<arr_sz; i++)
    res[i] += i;
}



int main(void) {

  int arr_sz=10; // Input parameter in real program
  int *my_array;
  my_array = malloc (sizeof(int)*arr_sz);
  for (int i=0; i<arr_sz; i++) my_array[i]=0; // init array
  
  #pragma omp declare reduction (my_omp_red : int* : my_add(omp_out) ) initializer (omp_priv=omp_orig)
  // How to add parameters other than omp_out and omp_in in the reduction function ?
  // How to write the initializer, because my_array is a pointer that has been allocated ?
  #pragma omp parallel 
  {
  #pragma omp for reduction (my_omp_red:my_array)
  for (int iloop=0; iloop<1024; iloop++) {
    my_add(my_array, arr_sz);
  }
  
  } // End of omp parallel
  
  for (int i=0; i<arr_sz; i++) printf("%d : %d\n",i,my_array[i]); // print array
  free(my_array); 

  return 0;
}

The reduction specifier is my_omp_red, the typename-list is int* because I would like to work on a allocatable array.

First, I do not know how to write the combiner because there is a parameter (arr_sz) to the function (my_add) and it seems that only omp_out and omp_in are allowed. This is a toy example, but this question is more generic : How to write combiner to use functions with parameters ?

And, also, I do not know how to write the initializer, because my_array is only a pointer, not a array. So, I suppose the local copy will be a pointer and I imagine an allocation is needed in the parallel part of code ? But how to free the pointer ? Etc, etc ... To be honest, I am a little bit lost.

Thanks for help.

EDIT 1

I have written a code that answers to the questions above but without reduction directive. Instead, critical directive has been used and allocation of array has been done for each thread.

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

void my_add(int *res, int arr_sz) {
  for (int i=0; i<arr_sz; i++)
    res[i] += i;
}

int main(void) {

  int arr_sz=10; // Input parameter in real program
  int *my_array;
  my_array = malloc (sizeof(int)*arr_sz);
  for (int i=0; i<arr_sz; i++) my_array[i]=0; // init array
  
  #pragma omp parallel 
  {
  int *local_array = malloc (sizeof(int)*arr_sz);
  for (int i=0; i<arr_sz; i++) local_array[i]=0; // init local_array
  
  #pragma omp for
  for (int iloop=0; iloop<1024; iloop++) {
    my_add(local_array, arr_sz);
  }
  
  #pragma omp critical
  {
  for (int i=0; i<arr_sz; i++) my_array[i]+=local_array[i];
  }
  
  free(local_array);
  } // End of omp parallel
  
  for (int i=0; i<arr_sz; i++) printf("%d : %d\n",i,my_array[i]); // print array
  free(my_array); 

  return 0;
}

I wonder if this code could be write with the reduction directive. .

2 Answers 2

2

Combining dynamically allocated arrays is possible with normal reduction. A reduction in OpenMP always works on array elements. A user-defined reduction would be necessary, if you want to combine elements which are not base types, or if you want to use combine operations which don't match the provided operations.

Here is the version of your code using reduction:

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

void my_add(int *res, int arr_sz) {
  for (int i=0; i<arr_sz; i++)
    res[i] += i;
}

int main(void) {

  int arr_sz=10; // Input parameter in real program
  int *my_array;
  my_array = malloc (sizeof(int)*arr_sz);
  for (int i=0; i<arr_sz; i++) my_array[i]=0; // init array
  
  #pragma omp parallel for reduction(+:my_array[:arr_sz])
  for (int iloop=0; iloop<1024; iloop++) {
    my_add(my_array, arr_sz);
  }
  
  for (int i=0; i<arr_sz; i++) printf("%d : %d\n",i,my_array[i]); // print array
  free(my_array); 

  return 0;
}
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot for answer. My toy example is too simple. I will think to another toy example where built-in + could not be used, where memory allocation must be taken in account and where function must have parameters.
1

My toy example is too simple. I will think to another toy example where built-in + could not be used, where memory allocation must be taken in account and where function must have parameters.

Here is a version that does that.

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

void my_add(int *res, int arr_sz) {
  for (int i=0; i<arr_sz; i++)
    res[i] += i;
}

struct ReductionArgs
{
  int *res;
  int arr_sz;
};
static void init_reduction(struct ReductionArgs* out, struct ReductionArgs in)
{
  out->res = calloc(in.arr_sz, sizeof(int));
  out->arr_sz = in.arr_sz;
}
static void add_reduction(struct ReductionArgs* inout, struct ReductionArgs in)
{
  for(int i = 0; i < in.arr_sz; ++i)
    inout->res[i] += in.res[i];
  free(in.res);
}

#pragma omp declare reduction \
  (my_add_reduction : struct ReductionArgs : (add_reduction(&omp_out, omp_in))) \
  initializer(init_reduction(&omp_priv, omp_orig))

int main(void) {

  int arr_sz=10; // Input parameter in real program
  int *my_array = calloc(arr_sz, sizeof(int));

  struct ReductionArgs reduced = { my_array, arr_sz };
# pragma omp parallel for reduction(my_add_reduction:reduced) 
  for(int iloop=0; iloop<1024; iloop++) {
    my_add(reduced.res, arr_sz);
  }
  
  for (int i=0; i<arr_sz; i++) printf("%d : %d\n",i,my_array[i]); // print array
  free(my_array); 
  
  return 0;
}

Explanation

The order of execution and the lifecycle of the variables takes a bit of an explanation. The standard is not particularly clear to me so I verified the actual implementation (both gcc with libgomp and clang and libomp):

  1. Every thread including the master thread create a new object struct ReductionArgs omp_priv. The omp_priv becomes the struct ReductionArgs reduced used by each thread inside the parallel section
  2. If there was no initializer clause, these objects would be initialized like static variables, meaning zero-initialization. Since we have an initializer, each thread instead calls init_reduction(&omp_priv, omp_orig) where omp_orig is the struct ReductionArgs reduced that we declared outside the parallel section
  3. Each thread does its loop
  4. The reduction step add_reduction(&omp_out, omp_in) is called once per thread with the respective local object as omp_in and some object as omp_out. It is not specified which thread(s) call this and whether they all accumulate to the same omp_out. From what I can tell, in libgomp each thread adds its own value to the original. In libomp there is some sort of hierarchical reduction where some threads accumulate values from their neighbors. That should improve parallelization and cache efficiency
  5. If the parallel section continues after the reduction, all threads now share the reduction result in the original variable

A complication is that while custom reductions can have an initializer, they do not have a destructor for handling deallocation. You have to handle this in the reduction step as you have no way of accessing the partial reductions after that.

Interaction with nowait

If you split the parallel and for pragmas into two as it was in the original code, you can attach the reduction to either pragma. If you attach it to the for, be wary of using the nowait optimization. The standard says:

If nowait is not specified for the construct, the reduction computation will be complete at the end of the region that corresponds to the construct; however, if the reduction clause is used on a construct to which nowait is also applied, accesses to the original list item will create a race and, thus, have unspecified effect unless synchronization ensures that they occur after all threads have executed all of their iterations or section constructs, and the reduction computation has completed and stored the computed value of that list item. This can be ensured simply through a barrier synchronization in most cases.

This also highlights why it's better to have the reduction at the end of the parallel section. There is always an implicit barrier at the end of the parallel section which conveniently can be used for the reduction instead of having to add another barrier at the end of the for. Of course this assumes that the result is not immediately used by multiple threads.

Optimizing first addition

Note that even thread 0 does not directly use the original object but a local instance and then adds this to the original. This is slightly wasteful in cases like this here where the original does not contain any values. We can avoid one reduction by assigning instead of adding in the first reduction. Change the code like this:

static void add_reduction(struct ReductionArgs* inout, struct ReductionArgs in)
{
  if(! inout->res) {
    /* move content instead of assigning */
    *inout = in;
    return;
  }
  for(int i = 0; i < in.arr_sz; ++i)
    inout->res[i] += in.res[i];
  free(in.res);
}
...
int main(void) {

  int arr_sz=10; // Input parameter in real program

  /* reduced will get array assigned in reduction */
  struct ReductionArgs reduced = { NULL, arr_sz };
# pragma omp parallel for reduction(my_add_reduction:reduced) 
  for(int iloop=0; iloop<1024; iloop++) {
    my_add(reduced.res, arr_sz);
  }
  int *my_array = reduced.res;
  ...
}

5 Comments

If you deallocate after the for, you cannot use for nowait, because the reduction could be executed in the next barrier.
Right, should have mentioned that. I assumed for nowait reduction(…) wouldn't even compile because, well, how is that supposed to work, right? But it does for some reason.
It is valid code. The semantic is that the result is available after the next barrier. In your case that would be the end of the parallel region.
Just to double-check my mental model of this: There will be a barrier (or other form of synchronization) at the end of a for nowait loop that ensures the main thread can access all thread items for the reduction but there will not be a second barrier after the main thread completed that, correct? Interestingly, the standard warns about accessing the main item but is quiet about changing other thread items while the main thread may still run the reduction.
The relevant sentence in OpenMP 6.0 is "To avoid data races, concurrent reads or updates of the original list item must be synchronized with the update of the original list item that occurs as a result of the reduction, which may occur after execution of the construct on which the reduction-scoping clause appears, for example, due to the use of a nowait clause." You cannot assume the reduction to be complete before the next synchronization after the end of the for region. After synchronization, all threads will have the reduced value added to their original value.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.