Poor performance when using OpenMP in inner loop

Question

I have the following code which is basically a forward substitution of a lower triangular matrix.

for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
{
    double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
    for (int j = matrix.get_rowptr()->operator[](i); j < matrix.get_diagonal_index()->operator[](i); ++j)
    {
        sum += matrix.get_value()->operator[](j) * result[matrix.get_columnindex()->operator[](j)];
    }
    result[i] = sum;
    result[i] = vector1[i] - result[i];
}

The first loop goes over the rows and the second one over the columns. The average number of operations in the inner loop is minium 100. I tried to use OpenMP to parallize the inner loop by simply adding #pragma omp parallel for But the wall time increased. Is there a way to parallize this peace of code in good way?

Thanks in advance. Best regards.

More details: how much work are the loops doing ? How did you time the execution ? What hardware resources do you have ? — High Performance Mark
– High Performance Mark, Commented Aug 28, 2018 at 21:21
The outer loop is performed about 50000 times and the inner loop minimum 100 times per outer loop. I measured the time using the high resolution clock before and after calling the function where the loop is called. The hardware is Intel Xeon E5-2640. — vydesaster
– vydesaster, Commented Aug 29, 2018 at 5:45
Usually you'd want it the other way around, you don't want to call the parallel region many times and you want them to be as long as possible, 100 is such a small number it might be faster if you simply used a single thread as the overhead of parallelising might take more time than simply letting a single thread do it. On a side note, how are get_rowptr and get_columnindex defined? — Qubit
– Qubit, Commented Aug 29, 2018 at 7:11
Ok, I understand. I will try to find another way. The functions get_rowptr and get_columnindex return a pointer on a std vector. — vydesaster
– vydesaster, Commented Aug 29, 2018 at 7:17

vydesaster · Accepted Answer · 2019-01-28 10:35:17Z

1

As explained in the comments, the poor performance is due to the call of small parallel regions in the inner loop. When re-writing the code to use parallelization for the outer loop, the performance increases with increasing number of threads.

answered Jan 28, 2019 at 10:35

vydesaster

2632 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Poor performance when using OpenMP in inner loop

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related