Performance problems using OpenMP in nested loops

Question

I'm using the following code, which contains an OpenMP parallel for loop nested in another for-loop. Somehow the performance of this code is 4 Times slower than the sequential version (omitting #pragma omp parallel for).

Is it possible that OpenMp has to create Threads every time the method is called? In my test it is called 10000 times directly after each other.

I heard that sometimes OpenMP will keep the threads spinning. I also tried setting OMP_WAIT_POLICY=active and GOMP_SPINCOUNT=INFINITE. When I remove the openMP pragmas, the code is about 10 times faster. Note that the method containing this code will be called 10000 times.

        for (round k = 1; k < processor.max; ++k) {
            initialise_round(k);


            for (std::vector<int> bucket : color_buckets) {
                #pragma omp parallel for schedule (dynamic)
                for (int i = 0; i < bucket.size(); ++i) {
                    if (processor.mark.is_marked_item(bucket[i])) {
                        processor.process(k, bucket[i]);
                    }
                }
            processor.finish_round(k);
        }
    }

What are typical sizes of the vectors color_buckets and bucket? — SirGuy
– SirGuy, Commented Feb 12, 2015 at 19:49
Is there a reason you parallelize the inner loop instead of the outer one? — SirGuy
– SirGuy, Commented Feb 12, 2015 at 19:51
Asside from @GuyGreer question I also wonder if you actually intend to create copies of color_buckets elements. If not do const std::vector<int> &bucket or use const auto &bucket. — mfuchs
– mfuchs, Commented Feb 12, 2015 at 19:52
The outer loop must be processed sequentially as the algorithm builds on top of the results of the previous rounds. The 2nd loop also can't be parallelized. The color buckets are the colors of a greedily colored dependency graph and therefore have to be processed sequentially. Typical size of color_bucket is small <=50. Typical size of bucket is about 100, but processing takes a comparably long time. Copying was not intended, but & brings no performance gain (probably ooptimized by compiler) — Jan B
– Jan B, Commented Feb 12, 2015 at 20:08
I think it really boils down to what "comparably long" actually is. Because with a parallel loop you have an obvious overhead like synchronisation, creating threads, switching between threads, ... Other than that there can also be a non-obvious overhead related to cache locality. If more data is needed than fits into the cache it might be that you have a lot cache misses on thread switches. Cache misses alone can easily make an algorithm infeasible. — mfuchs
– mfuchs, Commented Feb 13, 2015 at 10:39

VAndrei · Accepted Answer · 2015-02-12 20:23:25Z

2

You say that your sequential code is much faster so this makes me think that your processor.process function has too few instructions and duration. This leads to the case where passing the data to each thread does not pay off (the data exchange overhead is simply larger than the actual computation on that thread).

Other than that, I think that parallelizing the middle loop won't affect the algorithm but increase the amount of work per thread/

answered Feb 12, 2015 at 20:23

VAndrei

5,63023 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jan B Over a year ago

this is a lock free algo. The dependency graph is build in order to avoid concurrent writes to the same location. The time required to build the depeendency graph and color it is irrelevant as the query time (call to the method pictured above) is the only thing that is relevant. One thing I noticed is that the bucket size varies strongly between 1 and >130.

Z boson Over a year ago

@JanB, what does 1 and >130 mean?

Jan B Over a year ago

the amount of elements in one bucket vary between 1 and 130. Paralleling when only a couple of elements are in the bucket shurely makes no sense. To test I once duplicated the parallelized for and only parallelized when there where more than x (5,10,20..) elments in the bucket, but that didn't increase performance

robince · Accepted Answer · 2015-02-12 21:04:20Z

0

I think you are creating a team of threads on each iteration of the loop... (although I'm not sure what for alone does - I thought it should be parallel for). In this case, it would probably be better to separate the parallel from the for so the work of forking and creating the threads is done just once rather than being repeated in the other loops. So you could try to put a parallel pragma before your outermost loop so the overhead of forking and thread creation is just done once.

answered Feb 12, 2015 at 21:04

robince

11k3 gold badges38 silver badges49 bronze badges

7 Comments

Jan B Over a year ago

Actually there it is a parallel for. That was just a copy/paste error. Regarding the parallel block before the for loops: wouldn't the complete code then be called with multiple threads as: #pragma omp parallel cout << "asd" << endl; Is printed omp_get_num_threads times

robince Over a year ago

Yes you're right. You can wrap everything inside the parallel in a directive to ensure it works on one thread (master or single) - I think the parallel for should work correctly inside but the other stuff is run once. But I think I might be wrong anyway... this suggests the compiler is smart enough to save the threads...

robince Over a year ago

This answer suggests the for inside a master is wrong, but maybe the separate parallel and for directives would be OK.

robince Over a year ago

Also here

Jan B Over a year ago

that was what I was looking for. Although I think, that the environment variables I mentioned in the post take care of this. Your second link seems wrong somehow albeit the title appearing at hovering is correct it links to something about c# float to int conversion

|

Jan B · Accepted Answer · 2015-02-16 19:33:38Z

0

The actual problem was not related to OpenMP directly.

As the system has two CPUs, half of the threads where spawned on one and the other half on the other CPU. Therefore there was not shared L3 Cache. This lead in combination that the algorithm doesn't scale well to a performance decrease especially when using 2-4 Threads.

The solution was to use thread pinning for example via the linux tool: taskset

answered Feb 16, 2015 at 19:33

Jan B

5181 gold badge4 silver badges19 bronze badges

Collectives™ on Stack Overflow

Performance problems using OpenMP in nested loops

3 Answers 3

3 Comments

7 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related