Is there a way to express dependency on the first iteration of the parallelized loop?

Question

My program needs to perform some heavy calculations on all widgets in the box. The calculations are repeated an appreciable number of times processing multiple variations of each widget.

All of the subsequent variations depend on the base one, but are independent from each other.

Currently the program uses two parallelized loops:

Process each widget's base variation:

#pragma omp parallel for schedule(dynamic)
for (int i = 0; i < box.size(); i++) {
    process(base[i]);
}

Process all subsequent variations of each widget:

/* Older OpenMP didn's support nested loops */
#pragma omp parallel for schedule(dynamic)
for (long l = 0; l < box.size() * nVariations; l++) {
    int i = l % box.size();
    int variation = l / box.size();
    process(widgets[i][variation]);
}

This works, but is suboptimal, because the second loop is not starting until the first one is completed -- meaning, that the processing cores are underutilized while the last widgets are still going through the first loop.

So, I think, it'd need to be a single loop... Is there some way to express this dependency of the subsequent iterations on the base one, so that each widget can (depending on processor-availability) continue to be processed as soon as its base variation is calculated, but not before then?

Requirement: the program must continue to build and work both on Linux, where we use GNU C++ 14.x and on Windows, using Visual C++ 2019.

How long does the first loop take, compared with the second one? Have you done any timings? — catnip
– catnip, Commented Jul 21 at 19:06
Well I am not even sure you will gain all that much, by combining first and second loop, as the first loop and second loop work on different data. So this setup might even work more efficiently, specialy if base and widgets are contiguous in memory (e.g. they are vectors). So it is time to start measuring :) First profile your current code, get a baseline (check cache misses/branche prediction misses). Then if you combine the loops somehow... measure again — Pepijn Kramer
– Pepijn Kramer, Commented Jul 21 at 19:09
In other words optimization should always be based on measurements (on a release build with all optimizations on). — Pepijn Kramer
– Pepijn Kramer, Commented Jul 21 at 19:11
@PepijnKramer, consider a 32-core machine processing a box of 100 widgets, with base + 100 subsequent variations on each one. When there are fewer than 32 base variations remaining in the first loop, we begin underutilizing the available cores. The second loop would not begin even though there are 68 widgets ready for it. When the box is smaller than the core-count, we're wasting time through the entire first loop :-( — Mikhail T.
– Mikhail T., Commented Jul 21 at 19:17
@catnip, the variations currently take from 0.5 second to 5 seconds of CPU-time (measured with rusage()) depending on the widget-type. The base variation takes, on average, about 1.5 times longer than each subsequent one. — Mikhail T.
– Mikhail T., Commented Jul 21 at 19:18

Jérôme Richard · Accepted Answer · 2025-07-29 21:43:18Z

5

This is a perfect use-case for OpenMP tasks!

You can create some tasks and specify dependencies between them (forming a Directed Acyclic Graph -- a.k.a. DAG). Here is an untested example based on your code:

#pragma omp parallel
{
    std::vector<char> deps(box.size());

    #pragma omp single
    for (int i = 0; i < box.size(); i++) {
        // Note: base[i] is implicitly copied by default
        #pragma omp task depend(out: deps[i])
        process(base[i]);
    }

    // [...]

    #pragma omp single
    for (long l = 0; l < box.size() * nVariations; l++) {
        int i = l % box.size();
        #pragma omp task depend(in: deps[i])
        {
            int variation = l / box.size();
            process(widgets[i][variation]);
        }
    }

    // You can manually wait for task to be completed in the parallel section with
    // #pragma omp taskwait

    // All tasks are implicitly waited at the end of the parallel section
}

This does not work with the default MSVC 2019 OpenMP runtime which is completely obsolete since about 23 years so you need to use the (experimental) LLVM one on MSVC 2019.

Please note that dependencies are quite expensive so merging the two loops is better if you can, especially on many-cores CPU (since the sequential creation of the task can be a bottleneck). Here is an example without dependencies:

#pragma omp parallel
#pragma omp single
#pragma omp taskloop
for (long i = 0; i < box.size(); i++)
{
    process(base[i]);

    #pragma omp taskloop
    for (long v = 0; v < nVariations; v++)
    {
        process(widgets[i][v]);
    }

    // child tasks are implicitly awaited here (at the end of the whole loop)
}

// same thing here (implicit synchronisation), also due to the parallel section

This typically creates box.size() tasks where each tasks recursively create other tasks. Child tasks of the same parent task can be executed on a different threads so all threads can be active if there are enough available tasks (assuming the runtime does a good job). Please note that creating too many tasks can be expensive. Fortunately, you can easily control the amount of created tasks with additional clauses like grainsize or num_tasks. You can control implicit synchronisation with the nogroup clause.

edited Jul 29 at 21:43

answered Jul 22 at 0:07

Jérôme Richard

53.3k6 gold badges48 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Mikhail T. Jul 22 at 0:14

The second one is something I was thinking about, but would it work with VC++ 2019? My own testing happens only on Linux (gcc-14.x) and FreeBSD (clang++ 19.x) -- if I make commits, and then others can't build it on Windows, I'll be in some trouble :-)

Jérôme Richard Jul 22 at 0:48

The second one should work with the LLVM runtime (i.e. /openmp:llvm, AFAIK no need for the experimental one). For the first code, you need the experimental one (and I advise you to carefully test it). See the linked post in the question for more information about this.

Jérôme Richard Jul 22 at 0:51

Aside from this, CI is the way to go so to check commits before submitting to a main branch ;-) . They are now the gold-standard of all (big/serious) (open-source) projects, especially on Github/Gitlab. CI for Windows exists though this is IMHO quite a pain to setup (far better than testing things manually every-time). Independently of this, reviews are critical to avoid such issue or common mistakes (i.e. your colleagues on Windows can review your changes ;-) ).

Mikhail T. Jul 22 at 15:21

Our production servers run Linux and all build this code locally at deploy-time, to take advantage of the -march=native flag. But some of the developers are using Windows and so the code still needs to build there too -- and we don't have Windows CI-runners even for the projects, for which CI is used...

Joachim Jul 22 at 8:56

I think, the inner taskloop should have nogroup. Otherwise the loadbalance might be worse than with the initial solution.

Joachim Jul 24 at 18:12

Checking again, I realized that you are actually missing the single/master in parallel single/master taskloop. In your current version each thread will create box.size() tasks. Regarding the nogroup, you are right and the task scheduling constraint will not prevent the other threads executing the task when they arrive in the barrier.

|

Collectives™ on Stack Overflow

Is there a way to express dependency on the first iteration of the parallelized loop?

1 Answer 1

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related