0

I have the following code:

    omp_set_num_threads(10);
    
    int N = 10000;
    int chunk = 1000;
    int i;
    float a[N];

    for (i=0; i < N; i++) 
    {
            a[i] = i;
    }

    #pragma omp parallel private(i)
    {
            #pragma omp for schedule(dynamic,chunk) nowait
            for (i=0; i < N; i++) a[i] = a[i] + a[i];
    } 

The sequential code is the same, without the two pragma directives.

sequential ~150us parallel ~1100us

That's a huge gap, and I expected it to be the other way around.

Does someone have a clue what's wrong, or does OpenMP have so much to do in the background? Does someone have an example, where I can see that the parallelized for loop is faster?

Thanks

4
  • 2
    Parallel execution is usually best when the cost of doing one thing is not outweighed by the cost of scheduling things across multiple execution engines and having to coordinate everything. The cost of doubling 10,000 items is pretty low. Commented Aug 28, 2020 at 4:54
  • Thanks, that was the problem! Is there a way to preallocate the threads? Commented Aug 28, 2020 at 5:11
  • 1
    @gab: The threads have to be actually launched. They don't spring into existence on their own, and there's no way to make them exist for free through "preallocation". OpenMP hides the complexity of doing so from you, but the work still has to be done. Commented Aug 28, 2020 at 5:15
  • Since all sane OpenMP runtimes keep a thread pool, the cost of creating the threads is only paid once no matter how many parallel regions you have. Therefore a better way to measure the cost you'd see in a real code which is using OpenMP in many places in is to have an untimed, empty, parallel region to start the threads before you time the parallel region you are interested in. (And be careful which clock you use. omp_get_wtime() is OK; CPU consumption is not!) Commented Aug 28, 2020 at 7:35

1 Answer 1

2

Launching ten threads and managing them is more expensive than adding 10,000 elements. If you're on a system with less than 10 true cores, they're going to compete for time slices, and incur context switches and (potentially) more cache misses. And you chose dynamic scheduling, which means there needs to be some synchronization to help the threads figure out which chunks they're going to do (not a lot, but enough to slow things down when the work being distributed is pretty trivial by contrast).

In one anecdote launching a no-op thread and immediately detaching it cost about 10 μs, and those threads didn't need to do anything. In your case, that's 100 μs just for launching the threads, ignoring all the other inefficiencies threading potentially introduces.

Parallelizing helps for big workloads, or when the workers are used many times for many moderate sized tasks (so the cost of launching the threads is a fraction of the work they do). But performing 10,000 additions is chump change to a CPU; you're just not doing enough to benefit from parallelizing it.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.