1

My program needs to perform some heavy calculations on all widgets in the box. The calculations are repeated an appreciable number of times processing multiple variations of each widget.

All of the subsequent variations depend on the base one, but are independent from each other.

Currently the program uses two parallelized loops:

  1. Process each widget's base variation:
#pragma omp parallel for schedule(dynamic)
for (int i = 0; i < box.size(); i++) {
    process(base[i]);
}
  1. Process all subsequent variations of each widget:
/* Older OpenMP didn's support nested loops */
#pragma omp parallel for schedule(dynamic)
for (long l = 0; l < box.size() * nVariations; l++) {
    int i = l % box.size();
    int variation = l / box.size();
    process(widgets[i][variation]);
}

This works, but is suboptimal, because the second loop is not starting until the first one is completed -- meaning, that the processing cores are underutilized while the last widgets are still going through the first loop.

So, I think, it'd need to be a single loop... Is there some way to express this dependency of the subsequent iterations on the base one, so that each widget can (depending on processor-availability) continue to be processed as soon as its base variation is calculated, but not before then?

Requirement: the program must continue to build and work both on Linux, where we use GNU C++ 14.x and on Windows, using Visual C++ 2019.

8
  • How long does the first loop take, compared with the second one? Have you done any timings? Commented Jul 21 at 19:06
  • Well I am not even sure you will gain all that much, by combining first and second loop, as the first loop and second loop work on different data. So this setup might even work more efficiently, specialy if base and widgets are contiguous in memory (e.g. they are vectors). So it is time to start measuring :) First profile your current code, get a baseline (check cache misses/branche prediction misses). Then if you combine the loops somehow... measure again Commented Jul 21 at 19:09
  • In other words optimization should always be based on measurements (on a release build with all optimizations on). Commented Jul 21 at 19:11
  • @PepijnKramer, consider a 32-core machine processing a box of 100 widgets, with base + 100 subsequent variations on each one. When there are fewer than 32 base variations remaining in the first loop, we begin underutilizing the available cores. The second loop would not begin even though there are 68 widgets ready for it. When the box is smaller than the core-count, we're wasting time through the entire first loop :-( Commented Jul 21 at 19:17
  • @catnip, the variations currently take from 0.5 second to 5 seconds of CPU-time (measured with rusage()) depending on the widget-type. The base variation takes, on average, about 1.5 times longer than each subsequent one. Commented Jul 21 at 19:18

1 Answer 1

5

This is a perfect use-case for OpenMP tasks!

You can create some tasks and specify dependencies between them (forming a Directed Acyclic Graph -- a.k.a. DAG). Here is an untested example based on your code:

#pragma omp parallel
{
    std::vector<char> deps(box.size());

    #pragma omp single
    for (int i = 0; i < box.size(); i++) {
        // Note: base[i] is implicitly copied by default
        #pragma omp task depend(out: deps[i])
        process(base[i]);
    }

    // [...]

    #pragma omp single
    for (long l = 0; l < box.size() * nVariations; l++) {
        int i = l % box.size();
        #pragma omp task depend(in: deps[i])
        {
            int variation = l / box.size();
            process(widgets[i][variation]);
        }
    }

    // You can manually wait for task to be completed in the parallel section with
    // #pragma omp taskwait

    // All tasks are implicitly waited at the end of the parallel section
}

This does not work with the default MSVC 2019 OpenMP runtime which is completely obsolete since about 23 years so you need to use the (experimental) LLVM one on MSVC 2019.


Please note that dependencies are quite expensive so merging the two loops is better if you can, especially on many-cores CPU (since the sequential creation of the task can be a bottleneck). Here is an example without dependencies:

#pragma omp parallel
#pragma omp single
#pragma omp taskloop
for (long i = 0; i < box.size(); i++)
{
    process(base[i]);

    #pragma omp taskloop
    for (long v = 0; v < nVariations; v++)
    {
        process(widgets[i][v]);
    }

    // child tasks are implicitly awaited here (at the end of the whole loop)
}

// same thing here (implicit synchronisation), also due to the parallel section

This typically creates box.size() tasks where each tasks recursively create other tasks. Child tasks of the same parent task can be executed on a different threads so all threads can be active if there are enough available tasks (assuming the runtime does a good job). Please note that creating too many tasks can be expensive. Fortunately, you can easily control the amount of created tasks with additional clauses like grainsize or num_tasks. You can control implicit synchronisation with the nogroup clause.

Sign up to request clarification or add additional context in comments.

10 Comments

The second one is something I was thinking about, but would it work with VC++ 2019? My own testing happens only on Linux (gcc-14.x) and FreeBSD (clang++ 19.x) -- if I make commits, and then others can't build it on Windows, I'll be in some trouble :-)
The second one should work with the LLVM runtime (i.e. /openmp:llvm, AFAIK no need for the experimental one). For the first code, you need the experimental one (and I advise you to carefully test it). See the linked post in the question for more information about this.
Aside from this, CI is the way to go so to check commits before submitting to a main branch ;-) . They are now the gold-standard of all (big/serious) (open-source) projects, especially on Github/Gitlab. CI for Windows exists though this is IMHO quite a pain to setup (far better than testing things manually every-time). Independently of this, reviews are critical to avoid such issue or common mistakes (i.e. your colleagues on Windows can review your changes ;-) ).
Our production servers run Linux and all build this code locally at deploy-time, to take advantage of the -march=native flag. But some of the developers are using Windows and so the code still needs to build there too -- and we don't have Windows CI-runners even for the projects, for which CI is used...
I think, the inner taskloop should have nogroup. Otherwise the loadbalance might be worse than with the initial solution.
Checking again, I realized that you are actually missing the single/master in parallel single/master taskloop. In your current version each thread will create box.size() tasks. Regarding the nogroup, you are right and the task scheduling constraint will not prevent the other threads executing the task when they arrive in the barrier.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.