Why does gcc's implementation of openMP fail to parallelise a recursive function inside another recursive function

Question

I am trying to parallelise these recursive functions with openMP tasks,

when I compile with gcc it runs only on 1 thread. When i compile it with clang it runs on multiple threads

The second function calls the first one which doesn't generate new tasks to stop wasting time.

gcc does work when there is only one function that calls itself.

Why is this?

Am I doing something wrong in the code?

Then why does it work with clang?

I am using gcc 9.3 on windows with Msys2. The code was compiled with -O3 -fopenmp

//the program compiled by gcc only runs on one thread
#include<vector>
#include<omp.h>
#include<iostream>
#include<ctime>

using namespace std;

vector<int> vec;
thread_local double steps;

void excalibur(int current_node, int current_depth) {
    #pragma omp simd
    for( int i = 0 ; i < current_node; i++){
        ++steps;
        excalibur(i, current_depth);
    }
    if(current_depth > 0){
        int new_depth = current_depth - 1;
        #pragma omp simd
        for(int i = current_node;i <= vec[current_node];i++){
            ++steps;
            excalibur(i + 1,new_depth);
        }
    } 
}

void mario( int current_node, int current_depth) {
    #pragma omp task firstprivate(current_node,current_depth)
    {
        if(current_depth > 0){
            int new_depth = current_depth - 1;
            for(int i = current_node;i <= vec[current_node];i++){
                ++steps;
                mario(i + 1,new_depth);
            }
        } 
    } 
    #pragma omp simd
    for( int i = 0 ; i < current_node; i++){
        ++steps;
        excalibur(i, current_depth);
    }
}



int main() {
    double total = 0;
    clock_t tim = clock();
    omp_set_dynamic(0);
    int nodes = 10;
    int timesteps = 3;
    omp_set_num_threads(4);
    vec.assign( nodes, nodes - 2 );
    #pragma omp parallel
    {
        steps = 0;
        #pragma omp single
        {
            mario(nodes - 1, timesteps - 1);
        }
        #pragma omp atomic
            total += steps;
    }
    double time_taken = (double)(tim) / CLOCKS_PER_SEC;
    cout <<fixed<<total<<" steps, "<< fixed << time_taken << " seconds"<<endl;
    return 0;
}

while this works with gcc

#include<vector>
#include<omp.h>
#include<iostream>
#include<ctime>

using namespace std;

vector<int> vec;
thread_local double steps;

void mario( int current_node, int current_depth) {
    #pragma omp task firstprivate(current_node,current_depth)
    {
        if(current_depth > 0){
            int new_depth = current_depth - 1;
            for(int i = current_node;i <= vec[current_node];i++){
                ++steps;
                mario(i + 1,new_depth);
            }
        } 
    } 
    #pragma omp simd
    for( int i = 0 ; i < current_node; i++){
        ++steps;
        mario(i, current_depth);
    }
}



int main() {
    double total = 0;
    clock_t tim = clock();
    omp_set_dynamic(0);
    int nodes = 10;
    int timesteps = 3;
    omp_set_num_threads(4);
    vec.assign( nodes, nodes - 2 );
    #pragma omp parallel
    {
        steps = 0;
        #pragma omp single
        {
            mario(nodes - 1, timesteps - 1);
        }
        #pragma omp atomic
            total += steps;
    }
    double time_taken = (double)(tim) / CLOCKS_PER_SEC;
    cout <<fixed<<total<<" steps, "<< fixed << time_taken << " seconds"<<endl;
    return 0;
}

Please, provide a minimal reproducible example. There are many unknowns here that could cause this behavior (eg. Is there a parallel section? Where are the scheduling points? How your OpenMP environment is configured? How many tasks are created? How long each task last? Which version of GCC/GOMP is used? Which compiler arguments are used? etc.). — Jérôme Richard
– Jérôme Richard, Commented May 5, 2020 at 19:14
Gcc is too shy about removing infinite loops with no side effects, compared to llvm :-( — Marc Glisse
– Marc Glisse, Commented May 5, 2020 at 20:22
I might have phrased it wrong before, gcc does manage to compile the code but it only runs on one thread. — 2wings
– 2wings, Commented May 5, 2020 at 20:42

Hristo Iliev · Accepted Answer · 2020-05-06 22:17:32Z

Your program doesn't run in parallel because there is simply nothing to run in parallel. Upon first entry in mario, current_node is 9 and vec is all 8s, so this loop in the first and only task never executes:

        for(int i = current_node;i <= vec[current_node];i++){
            ++steps;
            mario(i + 1,new_depth);
        }

Hence, no recursive creation of new tasks. How and what runs in parallel when you compile it with Clang is well beyond me, since when I compile it with Clang 9, the executable behaves exactly the same as the one produced by GCC.

The second code runs in parallel because of the recursive call in the loop after the task region. But it also isn't a correct OpenMP program - the specification forbids nesting task regions inside a simd construct (see under Restrictions here):

The only OpenMP constructs that can be encountered during execution of a simd region are the atomic construct, the loop construct, the simd construct and the ordered construct with the simd clause.

None of the two compilers catches that problem when the nesting is in the dynamic and not in the lexical scope of the simd construct though.

Edit: I actually looked it a bit closer into it and I may have a suspicion about what might have caused your confusion. I guess you determine if your program works in parallel or not by looking at the CPU utilisation while it runs. This often leads to confusion. The Intel OpenMP runtime that Clang uses has a very aggressive waiting policy. When the parallel region in the main() function spawns a team of four threads, one of them goes executing mario() and the other three hit the implicit barrier at the end of the region. There they spin, waiting for new tasks to be eventually assigned to them. They never get one, but keep on spinning anyway, and that's what you see in the CPU utilisation. If you want to replicate the same with GCC, set OMP_WAIT_POLICY to ACTIVE and you'll see the CPU usage soar while the program runs. Still, if you profile the program's execution, you'll see that CPU time is spent inside your code in one thread only.

Collectives™ on Stack Overflow

Why does gcc's implementation of openMP fail to parallelise a recursive function inside another recursive function

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related