3

I am running a simulation in which many random numbers are generated. The RNG is implemented as a C++ object which has a public method returning the random number. In order to use it with OpenMP parallelization, I simply create an array of such RNG objects, one for every thread. Each thread then generates its own random numbers by calling one of the RNGs. E.g.:

  for (int i = 0; i < iTotThreads; i++) {
    aRNG[i] = new RNG();
  }
  // ... stuff here
#pragma omp parallel 
  {
    iT = omp_get_thread_num();
#pragma omp for
    for ( /* big loop */) {
      // more stuff
      aRNG[iT]->getRandomNumber();
      // more stuff
    }
  }  

Even though each RNG works on its own member variables and two such RNGs do not fit within a single cache line (I also tried explicitly aligning each of them at creation), there seems to be some false sharing going on as the code does not scale at all.

If I instantiate the objects within an omp parallel region:

#pragma omp parallel
  { 
    i = omp_get_thread_num();
    aRNG[i] = new RNG();
  }

the code scales perfectly. Do you have any idea of what I am missing here?

EDIT: by the way, in the second case (the one that scales well), the parallel region in which I create the RNGs is not the same as the one in which I use them. I'm counting on the fact that when I enter the second parallel region every pointer in aRNG[] will still point to one of my objects, but I guess this is bad practice...

11
  • Are you using any global variables (or static variables) in your random number generator ? Commented Jan 9, 2014 at 10:10
  • The result is normalized to a static const unsigned long MY_MAX_RAND before being returned, but otherwise each RNG only writes to its own private member variables and arrays. Commented Jan 9, 2014 at 10:14
  • Unrelated, but why are you using pointers here?! Commented Jan 9, 2014 at 13:22
  • The constructor of the RNG class actually takes some arguments so that the RNG of each parallel thread is initialized with a different seed. It was just more convenient for me to have an array of pointers, and then call new RNG() for each of them. Commented Jan 9, 2014 at 13:52
  • Most memory allocators nowadays are thread-aware and use separate per-thread memory arenas. Try adding a dummy padding variable to your PRNG state the size of a cache line and make sure the compiler does not optimise it out. Commented Jan 9, 2014 at 21:58

1 Answer 1

4

Although I doubt from your description that false sharing is the cause of your problem, why don't you simplify the code in this way:

  // ... stuff here
#pragma omp parallel 
  {
    RNG rng;
#pragma omp for
    for ( /* big loop */) {
      // more stuff
      rng.getRandomNumber();
      // more stuff
    }
  }

Being declared inside a parallel region rng will be a private variable with automatic storage duration, so:

  • each thread will have its own private random number generator (no false sharing possible here)
  • you don't have to manage allocation/deallocation of a resource

In case this approach is unfeasible, and following the suggestion of @HristoIliev, you can always declare a threadprivate variable to hold the pointer to the random number generator:

static std::shared_pointer<RNG> rng;
#pragma omp threadprivate(rng);

and allocate it in the first parallel region:

rng.reset( new RNG );

In this case though there are a few caveats to ensure that the value of rng will be preserved across parallel regions (quoting from the OpenMP 4.0 standard):

The values of data in the threadprivate variables of non-initial threads are guaranteed to persist between two consecutive active parallel regions only if all the following conditions hold:

  • Neither parallel region is nested inside another explicit parallel region.
  • The number of threads used to execute both parallel regions is the same.
  • The thread affinity policies used to execute both parallel regions are the same.
  • The value of the dyn-var internal control variable in the enclosing task region is false at entry to both parallel regions.

If these conditions all hold, and if a threadprivate variable is referenced in both regions, then threads with the same thread number in their respective regions will reference the same copy of that variable.

Sign up to request clarification or add additional context in comments.

10 Comments

It behaves like false sharing, but I agree that it is an unlikely explanation... I just have no idea of what could be the issue. There should be no conflict between the threads. Regarding your suggestion: the code is actually much more structured than in my example. Unfortunately, I cannot declare the RNG within the same parallel region in which I am using it; it would mean creating a new RNG at every simulation step.
@AstralCar, declare static RNG* rng; <new line> #pragma omp threadprivate(rng); allocate it in the first parallel region; delete it in the last parallel region.
I was looking into using threadprivate but I did not know that it does work with static pointers. Works perfectly now even without recurring to std::shared_ptr. Thanks guys!
@AstralCar Recurring to shared_ptr is just for resource management. In fact, you allocate the thing in the first parallel region and forget about its deallocation: shared_ptr will take care of that in its destructor.
@lorniper, firstprivate variables do not persist across parallel regions. threadprivate variables do. Also, threadprivate variables are not initialised like firstprivate since there is no "parent" variable to get their value from.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.