OpenMP C parallelizing nested for loop slow

Question

I've been trying to parallelize a nested loop as shown here:

I'm comparing the execution time of a sequential version and parallelized version of this code, but the sequential version always seems to have shorter execution times with a variety of inputs?

The inputs to the program are:

numParticles (loop index)
timeStep (not important, value doesn't change)
numTimeSteps (loop index)
numThreads (number of threads to be used)

I've looked around the web and tried some things out (nowait) and nothing really changed. I'm pretty sure the parallel code is correct because I checked the outputs. Is there something wrong I'm doing here?

EDIT: Also, it seems that you can't use the reduction clause on C structures?

EDIT2: Working on gcc on linux with 2 core cpu. I have tried running this with values as high as numParticles = 40 and numTimeSteps = 100000. Maybe I should try higher?

Thanks

What compiler are you using? What is the value of numParticles? How many cores does your CPU have? — Guy Sirton
– Guy Sirton, Commented Aug 21, 2011 at 2:51
Using gcc on linux. numParticles is 40 and numTimeSteps is 100000. CPU has 2 cores. i tried higher values for those two variables but the result is still same — Jason Lee
– Jason Lee, Commented Aug 21, 2011 at 3:03
numTimeSteps should have no impact since it is outside the parallel region. Are you enabling openmp? ( I think -fopenmp in gcc ) — Guy Sirton
– Guy Sirton, Commented Aug 21, 2011 at 3:22
OK. what happens with numParticles = 200, numTimeSteps = 100? — Guy Sirton
– Guy Sirton, Commented Aug 21, 2011 at 3:47
with 2 threads, the non-threaded version = 0 sec and threaded = 1 sec. I also tried with 500 10000 and the non-threaded version is still faster (184 sec vs 155 sec) — Jason Lee
– Jason Lee, Commented Aug 21, 2011 at 4:06

Guy Sirton · Accepted Answer · 2011-08-21 03:15:13Z

1

It is possible that your loops are too small. There is overhead associated with creating a thread to process a portion of the loop so if the loop is too small a parallelized version may run slower. Another consideration is the number of cores available.

Your second omp directive is less likely to be useful because there are a lot less calculations in that loop. I would suggest to remove it.

EDIT: I tested your code with numParticles 1000 and two threads. It ran in 30 seconds. The single threaded version ran in 57 seconds. Even with numParticles 40 I see a significant speedup. This is Visual Studio 2010.

edited Aug 21, 2011 at 3:15

answered Aug 21, 2011 at 3:03

Guy Sirton

8,4412 gold badges29 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jason Lee Over a year ago

oh right good to know that it works at least. I'm running this on linux with gcc compiler does this make any difference?

Guy Sirton Over a year ago

It does. The overhead of creating a thread may be larger. The OpenMP implementation may be slightly different. Try a larger number of particles (e.g. 500) and see what happens. Make sure you've enabled OpenMP support on the gcc command line.

Alexey Kukanov · Accepted Answer · 2011-08-21 17:29:31Z

1

I can think of two possible sources for slowdown: a) compiler made some optimizations (vectorization being first) in sequential version but not in OpenMP version, and b) thread management overhead. Both are easy to check if you also run the OpenMP version with a single thread (i.e. set numThreads to 1). If it is much slower than sequential, then (a) is the most likely reason; if it is similar to sequential and faster than the same code with 2 threads, the most likely reason is (b).

In the latter case, you may restructure the OpenMP code for less overhead. First, having two parallel regions (#pragma omp parallel) inside a loop is not necessary; you can have a single parallel region and two parallel loops inside it:

for (t = 0; t <= numTimeSteps; t++) {
    #pragma omp parallel num_threads(numThreads)
    {
    #pragma omp for private(j)
    /* The first loop goes here */
    #pragma omp for
    /* The second loop goes here */
    }
}

Then, the parallel region can be started before the timestep loop:

#pragma omp parallel num_threads(numThreads) private(t)
for (t = 0; t <= numTimeSteps; t++) {
    ...
}

Each thread in the region will then run this loop, and at each iteration threads will synchronize at the end of OpenMP loops. This way, you ensure that the same set of threads run through the whole computation, no matter what OpenMP implementation is used.

edited Aug 21, 2011 at 17:29

answered Aug 21, 2011 at 14:38

Alexey Kukanov

12.9k2 gold badges40 silver badges58 bronze badges

4 Comments

Guy Sirton Over a year ago

I was also thinking about the optimization... Perhaps auto-vectorizing. But you'd think the compiler would do the same for a sub-loop as well?

Jason Lee Over a year ago

I'm a bit confused about why the parallel region can be started before the timestep loop. The results of later iterations of the timestep loop is dependent on its first iteration, so wouldn't it be wrong to do that? And I have tried comparing the times with only 1 thread for the OpenMP version, and for small numParticles the execution times are same, but as I increase it, the OpenMP version gets slower and slower. With more than 1 thread, the OpenMP version is noticeably slower for even smaller numParticles

Alexey Kukanov Over a year ago

Each timestepping iteration will be synchronized across all threads, because #pragma omp for has an implicit barrier at the end. So a new iteration cannot start before all threads are done with the previous iteration.

Alexey Kukanov Over a year ago

@Guy Sirton: Yes, it's reasonable to expect that for the inner loop compiler will be able to apply the same optimizations. And this makes the hypothesis of compiler optimization impact less likely, which I think is also confirmed by Jason's observations. Thread management overhead seems the most likely reason for slowdown.

Collectives™ on Stack Overflow

OpenMP C parallelizing nested for loop slow

2 Answers 2

2 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related