1

I've been trying to parallelize a nested loop as shown here:

http://pastebin.com/nkictYsw

I'm comparing the execution time of a sequential version and parallelized version of this code, but the sequential version always seems to have shorter execution times with a variety of inputs?

The inputs to the program are:

  • numParticles (loop index)
  • timeStep (not important, value doesn't change)
  • numTimeSteps (loop index)
  • numThreads (number of threads to be used)

I've looked around the web and tried some things out (nowait) and nothing really changed. I'm pretty sure the parallel code is correct because I checked the outputs. Is there something wrong I'm doing here?

EDIT: Also, it seems that you can't use the reduction clause on C structures?

EDIT2: Working on gcc on linux with 2 core cpu. I have tried running this with values as high as numParticles = 40 and numTimeSteps = 100000. Maybe I should try higher?

Thanks

6
  • What compiler are you using? What is the value of numParticles? How many cores does your CPU have? Commented Aug 21, 2011 at 2:51
  • Using gcc on linux. numParticles is 40 and numTimeSteps is 100000. CPU has 2 cores. i tried higher values for those two variables but the result is still same Commented Aug 21, 2011 at 3:03
  • numTimeSteps should have no impact since it is outside the parallel region. Are you enabling openmp? ( I think -fopenmp in gcc ) Commented Aug 21, 2011 at 3:22
  • OK. what happens with numParticles = 200, numTimeSteps = 100? Commented Aug 21, 2011 at 3:47
  • with 2 threads, the non-threaded version = 0 sec and threaded = 1 sec. I also tried with 500 10000 and the non-threaded version is still faster (184 sec vs 155 sec) Commented Aug 21, 2011 at 4:06

2 Answers 2

1

It is possible that your loops are too small. There is overhead associated with creating a thread to process a portion of the loop so if the loop is too small a parallelized version may run slower. Another consideration is the number of cores available.

Your second omp directive is less likely to be useful because there are a lot less calculations in that loop. I would suggest to remove it.

EDIT: I tested your code with numParticles 1000 and two threads. It ran in 30 seconds. The single threaded version ran in 57 seconds. Even with numParticles 40 I see a significant speedup. This is Visual Studio 2010.

Sign up to request clarification or add additional context in comments.

2 Comments

oh right good to know that it works at least. I'm running this on linux with gcc compiler does this make any difference?
It does. The overhead of creating a thread may be larger. The OpenMP implementation may be slightly different. Try a larger number of particles (e.g. 500) and see what happens. Make sure you've enabled OpenMP support on the gcc command line.
1

I can think of two possible sources for slowdown: a) compiler made some optimizations (vectorization being first) in sequential version but not in OpenMP version, and b) thread management overhead. Both are easy to check if you also run the OpenMP version with a single thread (i.e. set numThreads to 1). If it is much slower than sequential, then (a) is the most likely reason; if it is similar to sequential and faster than the same code with 2 threads, the most likely reason is (b).

In the latter case, you may restructure the OpenMP code for less overhead. First, having two parallel regions (#pragma omp parallel) inside a loop is not necessary; you can have a single parallel region and two parallel loops inside it:

for (t = 0; t <= numTimeSteps; t++) {
    #pragma omp parallel num_threads(numThreads)
    {
    #pragma omp for private(j)
    /* The first loop goes here */
    #pragma omp for
    /* The second loop goes here */
    }
}

Then, the parallel region can be started before the timestep loop:

#pragma omp parallel num_threads(numThreads) private(t)
for (t = 0; t <= numTimeSteps; t++) {
    ...
}

Each thread in the region will then run this loop, and at each iteration threads will synchronize at the end of OpenMP loops. This way, you ensure that the same set of threads run through the whole computation, no matter what OpenMP implementation is used.

4 Comments

I was also thinking about the optimization... Perhaps auto-vectorizing. But you'd think the compiler would do the same for a sub-loop as well?
I'm a bit confused about why the parallel region can be started before the timestep loop. The results of later iterations of the timestep loop is dependent on its first iteration, so wouldn't it be wrong to do that? And I have tried comparing the times with only 1 thread for the OpenMP version, and for small numParticles the execution times are same, but as I increase it, the OpenMP version gets slower and slower. With more than 1 thread, the OpenMP version is noticeably slower for even smaller numParticles
Each timestepping iteration will be synchronized across all threads, because #pragma omp for has an implicit barrier at the end. So a new iteration cannot start before all threads are done with the previous iteration.
@Guy Sirton: Yes, it's reasonable to expect that for the inner loop compiler will be able to apply the same optimizations. And this makes the hypothesis of compiler optimization impact less likely, which I think is also confirmed by Jason's observations. Thread management overhead seems the most likely reason for slowdown.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.