OpenMP performance optmization with a large array (matrix)

Question

I am fairly new to OpenMP, sorry if the question seemes reduantant. Here is my sequential code that executes do_something() to every element in the row and save to the next row:

    for (int i = column; i < row * column; i++)
        A[n] = do_something(A[n - column]);

I tried to parallize it with simple parallel for

for (int i = 1; i < row; i++)
{   
    // I put column iteration in the inside loop
    // becuaes of cache miss and I tried to schedule the
    // same thread on the same cache line to avoid false 
    // sharing
    #pragma omp parallel for schedule(static, cacheline_size)
    for (int j = 0; j < column; j++)
    {
        A[i * column+ j] = do_something(A[(i - 1) * column+ j]);
    }
    
}

However I am only getting around 1.5x speedup per thread and the speedup capped around 4 threads with 2.5x speedup (I have enough cores).

I attempt to put the column loop outside but it is even slower than the sequential code. I suspect it is the thread creations in each loop, is there a way to improve this?

Edit: I am using g++ on Ubuntu 20.04

Parallel loops introduces a significant overhead. Putting them in a loop executed a lot of time is not efficient. This is especially true if the amount of work in the parallel loop is very small. Besides this, the serial and parallel code appear to be semantically different... Did you check the results are correct before analysing the performance? — Jérôme Richard
– Jérôme Richard, Commented Oct 15, 2021 at 16:39
@JérômeRichard Yes the result is correct since it is not a 2D array. I am open to other ideas than parallel for, but so far I haven't thought of any. — Romsdal
– Romsdal, Commented Oct 15, 2021 at 17:03
Ok. You could reverse the loop order but it may not be great. How much expensive is do_something? How many rows and columns is there? What is the type of the items in A? How long the loop is in sequential (1us, 1ms, 1s, 1h)? — Jérôme Richard
– Jérôme Richard, Commented Oct 15, 2021 at 17:31
The matrix is in integers and I tested with 1000000x1000. The sequential takes about 10s (no optimization). The problem is that the code doesn't scale. I tried to reverse the loops but it ended up slower than sequential. — Romsdal
– Romsdal, Commented Oct 15, 2021 at 17:40
If do_something is simple, this code is almost certainly memory bandwidth bound, so adding threads won't help. (Note, also, that any sane OpenMP runtime will not be creating and destroying threads here, since it will use a thread-pool.) — Jim Cownie
– Jim Cownie, Commented Oct 18, 2021 at 8:18

tstanisl · Accepted Answer · 2021-10-15 18:31:05Z

1

The loop over j is problematic. Starting thread team and synchronization at the end of parallel section is quite costly, especially for a large number of threads.

Moving loop outside breaks locality of reference. The classic solution to this problem is tiling. Split the problematic loop into two. One with step of cacheline over the whole range. Another with step of 1 through range 0 to cacheline-1. The outer loop is used to exploit parallelization and it should be move outside. While the innerloop exploits locality of reference and it should be moved inside.

#pragma omp parallel for
for (int j0 = 0; j0 < column; j0+=cacheline_size) {
for (int i = 1; i < row; i++) {
   int j1 = MIN(j0 + cacheline_size, column);
   for (int j = j0; j < j1; ++j)
        A[i * column+ j] = do_something(A[(i - 1) * column+ j]);
}}

Selecting the best step usually requires some experimentation.

edited Oct 15, 2021 at 18:31

answered Oct 15, 2021 at 17:13

tstanisl

14.3k3 gold badges31 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Romsdal Over a year ago

Thanks, this turned out to be scaling much better.

Daniel Dearlove · Accepted Answer · 2021-10-29 14:40:35Z

Using this benchmark, indeed, the solution of @tstanisl is much better.

#include <benchmark/benchmark.h>
#include <memory>
#include <omp.h>

constexpr int cacheline_size{64};

double do_something(const double v)
{
    return v * 2;
}

void do_matrix_calculation(double* A, const int column, const int row)
{
    for (int i = 1; i < row; i++)
    {   
        // I put column iteration in the inside loop
        // because of cache miss and I tried to schedule the
        // same thread on the same cache line to avoid false 
        // sharing
        #pragma omp parallel for schedule(static, cacheline_size)
        for (int j = 0; j < column; j++)
        {
            A[i * column+ j] = do_something(A[(i - 1) * column + j]);
        }
    }
}

void BM_matrix_calculation(benchmark::State& state)
{
    const int n_threads = state.range(0);
    const int cols = state.range(1);
    const int rows = state.range(2);
    std::unique_ptr<double[]> arr = std::make_unique<double[]>(cols * rows);

    omp_set_num_threads(n_threads);

    for (auto _ : state)
    {
        do_matrix_calculation(arr.get(), cols, rows);
    }
}

void do_matrix_calculation2(double* A, const int column, const int row)
{
    #pragma omp parallel for schedule(static, cacheline_size) collapse(2)
    for (int i = 1; i < row; i++)
    {   
        for (int j = 0; j < column; j++)
        {
            A[i * column+ j] = do_something(A[(i - 1) * column + j]);
        }
    }
}

void BM_matrix_calculation2(benchmark::State& state)
{
    const int n_threads = state.range(0);
    const int cols = state.range(1);
    const int rows = state.range(2);
    std::unique_ptr<double[]> arr = std::make_unique<double[]>(cols * rows);

    omp_set_num_threads(n_threads);

    for (auto _ : state)
    {
        do_matrix_calculation2(arr.get(), cols, rows);
    }
}

void do_matrix_calculation3(double* A, const int column, const int row)
{
    #pragma omp parallel for
    for (int j0 = 0; j0 < column; j0+=cacheline_size) 
    {
        for (int i = 1; i < row; i++) 
        {
            int j1 = std::min(j0 + cacheline_size, column);
            for (int j = j0; j < j1; ++j)
                A[i * column+ j] = do_something(A[(i - 1) * column+ j]);
        }
    }
}

void BM_matrix_calculation3(benchmark::State& state)
{
    const int n_threads = state.range(0);
    const int cols = state.range(1);
    const int rows = state.range(2);
    std::unique_ptr<double[]> arr = std::make_unique<double[]>(cols * rows);

    omp_set_num_threads(n_threads);

    for (auto _ : state)
    {
        do_matrix_calculation3(arr.get(), cols, rows);
    }
}

BENCHMARK(BM_matrix_calculation)
    ->Args({1, 1024, 1024})
    ->Args({2, 1024, 1024})
    ->Args({4, 1024, 1024})
    ->Args({8, 1024, 1024});

BENCHMARK(BM_matrix_calculation2)
    ->Args({1, 1024, 1024})
    ->Args({2, 1024, 1024})
    ->Args({4, 1024, 1024})
    ->Args({8, 1024, 1024});

BENCHMARK(BM_matrix_calculation3)
    ->Args({1, 1024, 1024})
    ->Args({2, 1024, 1024})
    ->Args({4, 1024, 1024})
    ->Args({8, 1024, 1024});

BENCHMARK_MAIN();

The difference in performance between BM_matrix_calculation and BM_matrix_calculation2 seems to be sensitive to the cache-line size. Of course, this needs to be optimized for your particular machine.

Collectives™ on Stack Overflow

OpenMP performance optmization with a large array (matrix)

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related