My program needs to perform some heavy calculations on all widgets in the box. The calculations are repeated an appreciable number of times processing multiple variations of each widget.
All of the subsequent variations depend on the base one, but are independent from each other.
Currently the program uses two parallelized loops:
- Process each widget's base variation:
#pragma omp parallel for schedule(dynamic)
for (int i = 0; i < box.size(); i++) {
process(base[i]);
}
- Process all subsequent variations of each widget:
/* Older OpenMP didn's support nested loops */
#pragma omp parallel for schedule(dynamic)
for (long l = 0; l < box.size() * nVariations; l++) {
int i = l % box.size();
int variation = l / box.size();
process(widgets[i][variation]);
}
This works, but is suboptimal, because the second loop is not starting until the first one is completed -- meaning, that the processing cores are underutilized while the last widgets are still going through the first loop.
So, I think, it'd need to be a single loop... Is there some way to express this dependency of the subsequent iterations on the base one, so that each widget can (depending on processor-availability) continue to be processed as soon as its base variation is calculated, but not before then?
Requirement: the program must continue to build and work both on Linux, where we use GNU C++ 14.x and on Windows, using Visual C++ 2019.
baseandwidgetsare contiguous in memory (e.g. they are vectors). So it is time to start measuring :) First profile your current code, get a baseline (check cache misses/branche prediction misses). Then if you combine the loops somehow... measure againrusage()) depending on the widget-type. The base variation takes, on average, about 1.5 times longer than each subsequent one.