1

I want to effectively parallelize the following sum in C:

  #pragma omp parallel for num_threads(nth)
  for(int i = 0; i < l; ++i) pout[pg[i]] += px[i];

where px is a pointer to a double array x of size l containing some data, pg is a pointer to an integer array g of size l that assigns each data point in x to one of ng groups which occur in a random order, and pout is a pointer to a double array out of size ng which is initialized with zeros and contains the result of summing x over the grouping defined by g.

The code above works, but the performance is not optimal so I wonder if there is somewthing I can do in OpenMP (such as a reduction() clause) to improve the execution. The dimensions l and ng of the arrays, and the number of threads nth are available to me and fixed beforehand. I cannot directly access the arrays, only the pointers are passed to a function which does the parallel sum.

4
  • either iterate over groups or apply reduction over whole pout array if OMP in your compiler supports it Commented Jan 7, 2022 at 17:09
  • 1
    The kind of indirect addressing you show is is about as cache-unfriendly as iterative code gets; then you throw multiple threads at it so they can all contend over the shared cache. I wouldn't be surprised to learn that single-threaded execution is as good as it gets. Commented Jan 7, 2022 at 17:09
  • By "The code above works", you mean the sequential implementation without the OpenMP pragma isn't it? Otherwise, it would be surprising. How big are l and ng in practice? How many threads do you expect to approximately use? The best resulting algorithm will change a lot based on these values. Commented Jan 7, 2022 at 17:24
  • Thanks for the comments. The code is an interior component of some data manipulation software, arrays can be up to 100 million elements or more. The inputs are as they are, rewriting the whole thing to execute one group at a time would be tedious and not faster. I am a beginner in C programming @HighPerformanceMark so if you can show me how to write this loop in a better way I'd highly appreciate. As far as my benchmarks go, using pointers is not much slower than direct array indexation. Parallel execution gives about 2x speedup with 4 threads, so not optimal. The software runs on PC's. Commented Jan 7, 2022 at 18:06

2 Answers 2

3
  1. Your code has a data race (at line pout[pg[i]] += ...), you should fix it first, then worry about its performance.

  2. if ng is not too big and you use OpenMP 4.5+, the most efficient solution is using reduction: #pragma omp parallel for num_threads(nth) reduction(+:pout[:ng])

  3. if ng is too big, most probably the best idea is to use a serial version of the program on PCs. Note that your code will be correct by adding #pragma omp atomic before pout[pg[i]] += .., but its performance is questionable.

Sign up to request clarification or add additional context in comments.

10 Comments

Thanks a lot also! I will also check this out.
So I tried this with 100 Million random normal variates and 1 million random groups. Serial execution is 1.3 sec, execution with two threads gives about 1 sec, execution with 4 threads about 0.8 sec. The result is more numerically correct than what I have above (indeed there were data race issues), but small numerical issues remain (mean relative difference 1.4382e-06). Overall I think OpenMP alone does not solve the issue sufficiently well, so I'll go with serial code for the current software release and look into the more elaborate parallel programming answers later.
@Sebastian Small numerical differences are to be expected with parallelization because floating point addition is not associative and the order of operations matters. Doesn't mean the parallelized version is worse, it may even be more accurate. If it bothers you, you could use Kahan summation for higher accuracy but that will cost a lot of runtime. See docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html for the famous "What Every Computer Scientist Should Know About Floating-Point Arithmetic" document
This only works if pout is statically allocated, right? Not if it's malloced or so?
Thanks @Homer512 for the clarification, then the result is reasonable. @VictorEijkhout pout is filled using memset(pout, 0.0, sizeof(double) * ng);
|
2

From your description it sounds like you have a many-to-few mapping. That is a big problem for parallelism because you likely have write conflicts in the target array. Attempts to control with critical sections or locks will probably only slow down the code.

Unless it is prohibitive in memory, I would give each thread a private copy of pout and sum into that, then add those copies together. Now the reading of the source array can be nicely divided up between the threads. If the pout array is not too large, your speedup should be decent.

Here is the crucial bit of code:

#pragma omp parallel shared(sum,threadsum)
  {
    int thread = omp_get_thread_num(),
      myfirst = thread*ngroups;
    #pragma omp for
    for ( int i=0; i<biglen; i++ )
      threadsum[ myfirst+indexes[i] ] += 1;
    #pragma omp for
    for ( int igrp=0; igrp<ngroups; igrp++ )
      for ( int t=0; t<nthreads; t++ )
        sum[igrp] += threadsum[ t*ngroups+igrp ];
  }

Now for the tricky bit. I'm using an index array of size 100M, but the number of groups is crucial. With 5000 groups I get good speedup, but with only 50, even though I've eliminated things like false sharing, I get pathetic or no speedup. This is not clear to me yet.

Final word: I also coded @Laci's solution of just using a reduction. Testing on 1M groups output: For 2-8 threads the reduction solution is actually faster, but for higher thread counts I win by almost a factor of 2 because the reduction solution repeatedly adds the whole array while I sum it just once, and then in parallel. For smaller numbers of groups the reduction is probably preferred overall.

11 Comments

Thanks! I understand what you are saying and it sounds like a good solution to me. Given that I am new to parallel programming, would it be possible for you to give me a basic template of how that could look like?
Thanks a lot! I will experiment around with this!
You stated that you could have 100M elements in the input. What is a typical value of the number of groups?
I wonder if the poor performance with N=50 is related to store forwarding penalties. Basically, a 3-5 cycle penalty when reading values that were written very recently. This naturally has a higher chance of occuring in small N, though I would have expected this to happen with much smaller N. What distribution are your input values? It could also relate to cache-line bouncing at the border of the per-thread arrays. That should go away when N is a multiple of 8.
@Homer512 I thought your approach was interesting. And benchmarking depends on the machine. I have 56 core two-socket nodes with tons of bandwidth.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.