0

I've been playing around with OpenMP, and am trying to see if I can get a speedup in a particular bit of C++ code.

    #pragma omp parallel for
    for (Index j=alignedSize; j<size; ++j)
    {
      res[j] = cj.pmadd(lhs0(j), pfirst(ptmp0), res[j]);
      res[j] = cj.pmadd(lhs1(j), pfirst(ptmp1), res[j]);
      res[j] = cj.pmadd(lhs2(j), pfirst(ptmp2), res[j]);
      res[j] = cj.pmadd(lhs3(j), pfirst(ptmp3), res[j]);
    }

I'm a complete newbie with OpenMP so be gentle with me, but could someone shed some light on why this code ends up doubling the execution time rather than speeding it up?

I'm running with 4 cores, just in case that matters.

1
  • How did you measure time? What are your specific results? Can you provide the code in form of a minimal reproducible example? What is the specific processor model and memory setup of the system? Commented Dec 17, 2016 at 20:13

2 Answers 2

2

What is the size of a res entry? If its less than the size of a cache line then its likely false sharing.

Sign up to request clarification or add additional context in comments.

1 Comment

A res entry is 8 bytes long, and so assuming a 64 byte long cache line, it looks like I would want to assign 8 iterations per thread? Something like #pragma omp parallel for schedule(static,8) ?
0

A bare minimum for typical cpu would be chunks of 128 bytes and then you would need unified last level cache.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.