0

I have the following code which is basically a forward substitution of a lower triangular matrix.

for (int i = 0; i < matrix.get_rowptr()->size() - 1; ++i)
{
    double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
    for (int j = matrix.get_rowptr()->operator[](i); j < matrix.get_diagonal_index()->operator[](i); ++j)
    {
        sum += matrix.get_value()->operator[](j) * result[matrix.get_columnindex()->operator[](j)];
    }
    result[i] = sum;
    result[i] = vector1[i] - result[i];
}

The first loop goes over the rows and the second one over the columns. The average number of operations in the inner loop is minium 100. I tried to use OpenMP to parallize the inner loop by simply adding #pragma omp parallel for But the wall time increased. Is there a way to parallize this peace of code in good way?

Thanks in advance. Best regards.

9
  • Sorry. I have added the code line. Commented Aug 28, 2018 at 21:09
  • 2
    More details: how much work are the loops doing ? How did you time the execution ? What hardware resources do you have ? Commented Aug 28, 2018 at 21:21
  • The outer loop is performed about 50000 times and the inner loop minimum 100 times per outer loop. I measured the time using the high resolution clock before and after calling the function where the loop is called. The hardware is Intel Xeon E5-2640. Commented Aug 29, 2018 at 5:45
  • Usually you'd want it the other way around, you don't want to call the parallel region many times and you want them to be as long as possible, 100 is such a small number it might be faster if you simply used a single thread as the overhead of parallelising might take more time than simply letting a single thread do it. On a side note, how are get_rowptr and get_columnindex defined? Commented Aug 29, 2018 at 7:11
  • Ok, I understand. I will try to find another way. The functions get_rowptr and get_columnindex return a pointer on a std vector. Commented Aug 29, 2018 at 7:17

1 Answer 1

1

As explained in the comments, the poor performance is due to the call of small parallel regions in the inner loop. When re-writing the code to use parallelization for the outer loop, the performance increases with increasing number of threads.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.