How to parallelize this array correct way using OpenMP?

Question

After I try to parallelize the code with openmp, the elements in the array are wrong, as for the order of the elements is not very important. Or is it more convenient to use c++ std vector instead of array to parallelize, could you suggest a easy way?

#include <stdio.h>
#include <math.h>

int main()
{
    int n = 100;
    int a[n*(n+1)/2]={0};
    int count=0;

    #pragma omp parallel for reduction(+:a,count)
    for (int i = 1; i <= n; i++) {
        for (int j = i + 1; j <= n; j++) {
            double k = sqrt(i * i + j * j);
            if (fabs(round(k) - k) < 1e-10) {
                a[count++] = i;
                a[count++] = j;
                a[count++] = (int) k;
            }
        }
    }
    
    for(int i=0;i<count;i++)
        printf("%d %s",a[i],(i+1)%3?"":", ");
    printf("\ncount: %d", count);
    return 0;
}

Original output:

3 4 5 , 5 12 13 , 6 8 10 , 7 24 25 , 8 15 17 , 9 12 15 , 9 40 41 , 10 24 26 , 11 60 61 , 12 16 20 , 12 35 37 , 13 84 85 , 14 48 50 , 15 20 25 , 15 36 39 , 16 30 34 , 16 63 65 , 18 24 30 , 18 80 82 , 20 21 29 , 20 48 52 , 20 99 101 , 21 28 35 , 21 72 75 , 24 32 40 , 24 45 51 , 24 70 74 , 25 60 65 , 27 36 45 , 28 45 53 , 28 96 100 , 30 40 50 , 30 72 78 , 32 60 68 , 33 44 55 , 33 56 65 , 35 84 91 , 36 48 60 , 36 77 85 , 39 52 65 , 39 80 89 , 40 42 58 , 40 75 85 , 40 96 104 , 42 56 70 , 45 60 75 , 48 55 73 , 48 64 80 , 48 90 102 , 51 68 85 , 54 72 90 , 56 90 106 , 57 76 95 , 60 63 87 , 60 80 100 , 60 91 109 , 63 84 105 , 65 72 97 , 66 88 110 , 69 92 115 , 72 96 120 , 75 100 125 , 80 84 116 ,
count: 189

After using OpenMP(gcc file.c -fopenmp):

411 538 679 , 344 609 711 , 354 533 649 , 218 387 449 , 225 475 534 , 182 283 339 , 81 161 182 , 74 190 204 , 77 138 159 , 79 176 195 , 18 24 30 , 18 80 82 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 , 0 0 0 ,
count: 189

as you are asking for vector this seems to be C++, not C. You should then use the C++ headers and not tag C — 463035818_is_not_an_ai
– 463035818_is_not_an_ai, Commented Jun 10, 2021 at 11:46
also note that int a[n*(n+1/2)]={0}; is ok in C but in C++ only as a compiler extension: Why aren't variable-length arrays part of the C++ standard? — 463035818_is_not_an_ai
– 463035818_is_not_an_ai, Commented Jun 10, 2021 at 11:47
btw the formula for the size looks wrong, as n is an int your n*(n+1/2) is always the same as n*n (while your loop uses much less elements) — 463035818_is_not_an_ai
– 463035818_is_not_an_ai, Commented Jun 10, 2021 at 11:52
Your reduction clause is not doing what you expect it to do. If I interpret this right, you want to concatenate the private versions of a, not add them together. — paleonix
– paleonix, Commented Jun 10, 2021 at 11:54

Toby Speight · Accepted Answer · 2021-06-11 06:20:38Z

3

Your threads are all accessing the shared count.

You would be better off eliminating count and have each loop iteration determine where to write its output based only on the (per-thread) values of i and j.

Alternatively, use a vector to accumulate the results:

#include <cmath>
#include <iostream>
#include <utility>
#include <vector>

#pragma omp declare                                                        \
    reduction(vec_append : std::vector<std::pair<int,int>> :               \
              omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))

int main()
{
    constexpr int n = 100'000;
    std::vector<std::pair<int,int>> result;

#pragma omp parallel for \
            reduction(vec_append:result) \
            schedule(dynamic)
    for (int i = 1;  i <= n;  ++i) {
        for (int j = i + 1;  j <= n;  ++j) {
            auto const h2 = i * i + j * j; // hypotenuse squared
            int h = std::sqrt(h2) + 0.5;   // integer square root
            if (h * h == h2) {
                result.emplace_back(i, j);
            }
        }
    }

    // for (auto const& v: result) {
    //     std::cout << v.first << ' '
    //               << v.second << ' '
    //               << std::hypot(v.first, v.second) << ' ';
    // }
    std::cout << "\ncount: " << result.size() << '\n';
}

edited Jun 11, 2021 at 6:20

answered Jun 10, 2021 at 12:12

Toby Speight

32.1k58 gold badges82 silver badges117 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

paleonix Over a year ago

Not sure if this interacts in the right way with OpenMP's reduction clause, but at least in the serial version it would be clever to use std::vector::reserve here to not have reallocations inside the loops. What certainly should work, would be separating the parallel and the for region and do the reserve in between with #pragma omp parallel private(result), right?

Toby Speight Over a year ago

It's hard to know how much to reserve(), unless there's some formula for estimating the number of Pythagorean triangles we expect to find. OP code is very conservative, assuming ⅓ of triples would be matches. And yes, we'd need to reserve both the private and shared versions of result, which is probably even harder to estimate. So I just ignored the issue and let the vector do its own thing.

paleonix Over a year ago

Fair enough. I guess one could reserve the maximum (n * (n - 1) / 2) for the shared version and divide that number by omp_get_num_threads() for the private ones, but one would have to benchmark, if it makes a significant, positive difference.

paleonix Over a year ago

While std::vector::reserve() doesn't seem to do much here, there are two other details that improved on your solution by almost a factor of 3 on my system: 1. use schedule(dynamic), as the workload will be very unbalanced with the default static scheduling. 2. Don't use std::hypot here, as it will compute the squares in double format. Multiplying doubles is much more expensive than multiplying ints or even longs (which will be needed to prevent overflow). As this algorithm is not memory-bound (on my system), this kind of stuff is actually significant.

Toby Speight Over a year ago

I forgot that std::hypot() isn't overloaded for integers - good call. I was surprised that schedule(dynamic) makes a difference - I forgot that that the most time would be spent inside the if condition. I'll update the code with these changes. Thanks again!

|

paleonix · Accepted Answer · 2021-06-12 13:14:25Z

2

As an alternative to using a critical section, this solution uses atomics and could therefore be faster.

The following code might freeze your computer due to memory consumption. Be careful!

#include <cstdio>
#include <cmath>

#include <vector>

int main() {
    int const n = 100;
    // without a better (smaller) upper_bound this is extremely
    // wasteful in terms of memory for big n 
    long const upper_bound = 3L * static_cast<long>(n) *
                             (static_cast<long>(n) - 1L) / 2l; 
    std::vector<int> a(upper_bound, 0);
    int count = 0;

    #pragma omp parallel for schedule(dynamic) shared(a, count)
    for (int i = 1; i <= n; ++i) {
        for (int j = i + 1; j <= n; ++j) {
            double const k = std::sqrt(static_cast<double>(i * i + j * j));

            if (std::fabs(std::round(k) - k) < 1e-10) {
                int my_pos;
                #pragma omp atomic capture
                my_pos = count++;

                a[3 * my_pos] = i;
                a[3 * my_pos + 1] = j;
                a[3 * my_pos + 2] = static_cast<int>(std::round(k));
            }
        }
    }
    count *= 3;
    
    for(int i = 0; i < count; ++i) {
        std::printf("%d %s", a[i], (i + 1) % 3 ? "" : ", ");
    }
    printf("\ncount: %d", count);

    return 0;
}

EDIT: My answer was initially a reaction to a by now deleted answer using a critical section in a sub-optimal way. In the following I will present another solution which combines a critical section with using std::vector::emplace_back() to circumvent the need for upper_bound similar to Toby Speight's solution. Generally using a reduce clause like in Toby Speight's solution should be preferred over critical sections and atomics, as reductions should scale better for big numbers of threads. In this particular case (relatively few calculations will be written to a) and without a big amount of cores to run on, the following code might still be preferable.

#include <cstdio>
#include <cmath>

#include <tuple>
#include <vector>

int main() {
    int const n = 100;

    std::vector<std::tuple<int, int, int>> a{};
    
    // optional, might reduce number of reallocations
    a.reserve(2 * n); // 2 * n is an arbitrary choice

    #pragma omp parallel for schedule(dynamic) shared(a)
    for (int i = 1; i <= n; ++i) {
        for (int j = i + 1; j <= n; ++j) {
            double const k = std::sqrt(static_cast<double>(i * i + j * j));

            if (std::fabs(std::round(k) - k) < 1e-10) {
                #pragma omp critical
                a.emplace_back(i, j, static_cast<int>(std::round(k)));
            }
        }
    }
    long const count = 3L * static_cast<long>(a.size());
    
    for(unsigned long i = 0UL; i < a.size(); ++i) {
        std::printf("%d %d %d\n",
                    std::get<0>(a[i]), std::get<1>(a[i]), std::get<2>(a[i]));
    }
    printf("\ncount: %ld", count);

    return 0;
}

edited Jun 12, 2021 at 13:14

answered Jun 10, 2021 at 12:38

paleonix

3,3155 gold badges20 silver badges41 bronze badges

5 Comments

Laci Over a year ago

2 corrections: std::vector<int> a(n, 0); ---- std::vector<int> a(n*(n-1)/2, 0); mypos should be my_pos in a[3 * mypos + 2] = static_cast<int>(std::round(k));

Laci Over a year ago

This version runs faster than @Toby Speight's one. n=20'000 g++ (10.2) -O3 -fopenmp Results: real 0m0.231s vs. 0m0.267s user 0m1.060s vs 0m2.443s

paleonix Over a year ago

@Laci Even if this result were independent of hardware and n, it might still be different for the full code (vs minimal example) of the OP or any future reader. So to me it doesn't really matter which one is faster for the minimal example.

Laci Over a year ago

OK. But there is a strange issue: if n=50'000 the counts are different.. 226383 vs. 260250 (3*86750). Note that, I have changed int const n = 100; to size_t const n = 50000; Any idea why?

Laci Over a year ago

OK. I have figured it out: 50000*50000 is bigger than the maximum of int

Daniel Dearlove · Accepted Answer · 2021-06-10 12:14:00Z

2

The count variable is an index into a. The reduction(+:a,count) operator is summing the array, it is not a concatenation operation which is what I think you are looking for.

The count variable needs to be surrounded by a mutex, something like #pragma omp critical, but I am not an OpenMP expert.

Alternatively, create int a[n][n], set all of them to -1 (a sentinel value to indicate "invalid") then assign the result of the sqrt() when it is near enough to a whole number.

answered Jun 10, 2021 at 12:14

Daniel Dearlove

5732 silver badges12 bronze badges

Collectives™ on Stack Overflow

How to parallelize this array correct way using OpenMP?

3 Answers 3

6 Comments

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related