Parallelization Of Insertion Sort With openMP, compared to serial version, gives very spiky results

Question

Context

Hello!

I was finishing up a slight hands on experience with POSIX pthreads, and OpenMP, trying to compare the both with the serial implementation of Insertion Sort, trying to see which one performs well on what kind of inputs.

The results I got were weird to say the least.

Maybe it could be that my code is buggy. I would say that my pthread implementation sure is and needs a lot of more work since it's a very lousy implementation. However, taking a look at the time it took for OpenMP and Serial, I saw very weird trends.

Code

Before I show you the graph of the time taken that I got, here is the code for each.

For Serial,


static inline void *
insertionSort(void *arrayMetaDataToUnpack)
{
    // UNPACK START
    ull *_Array = ((InsertionSortPacking *)arrayMetaDataToUnpack)->Array;
    size_t sizeOfArray = ((InsertionSortPacking *)arrayMetaDataToUnpack)->ArraySize;
    // UNPACK END

    // Compulsory use of reinterpret_cast to keep everything consistent
    ull *__tempArray = reinterpret_cast<ull *>(_Array);

    for (int i = 1; i < sizeOfArray; i++)
    {
        int __temp = __tempArray[i];
        int j = i - 1;

        while (j >= 0 && __temp <= __tempArray[j])
        {
            __tempArray[j + 1] = __tempArray[j];
            j = j - 1;
        }

        {
            __tempArray[j + 1] = __temp;
        }
    }

    return __tempArray;
}
static inline void *
insertionSort(void *arrayMetaDataToUnpack)
{
    // UNPACK START
    ull *_Array = ((InsertionSortPacking *)arrayMetaDataToUnpack)->Array;
    size_t sizeOfArray = ((InsertionSortPacking *)arrayMetaDataToUnpack)->ArraySize;
    // UNPACK END

    // Compulsory use of reinterpret_cast to keep everything consistent
    ull *__tempArray = reinterpret_cast<ull *>(_Array);

    for (int i = 1; i < sizeOfArray; i++)
    {
        int __temp = __tempArray[i];
        int j = i - 1;

        while (j >= 0 && __temp <= __tempArray[j])
        {
            __tempArray[j + 1] = __tempArray[j];
            j = j - 1;
        }

        {
            __tempArray[j + 1] = __temp;
        }
    }

    return __tempArray;
}

The OpenMP implementation,

static inline void *
openmp_insertionSort(void *arrayMetaDataToUnpack)
{
    // UNPACK START
    ull *_Array = ((InsertionSortPacking *)arrayMetaDataToUnpack)->Array;
    size_t sizeOfArray = ((InsertionSortPacking *)arrayMetaDataToUnpack)->ArraySize;
    // UNPACK END

    // Compulsory use of reinterpret_cast to keep everything consistent
    ull *__tempArray = reinterpret_cast<ull *>(_Array);

#pragma omp for
    for (int i = 1; i < sizeOfArray; i++)
    {
        int __temp = __tempArray[i];
        int j = i - 1;

        while (j >= 0 && __temp <= __tempArray[j])
        {
            __tempArray[j + 1] = __tempArray[j];
            j = j - 1;
        }

#pragma omp critical
        {
            __tempArray[j + 1] = __temp;
        }
    }

    return __tempArray;
}

Again, disclaimer, I'm new to OpenMP, and I might have parallelized the code well, but this is all I know at a beginner stage to use it. It does do the sorting well, without race conditions.

Inputs

So for the inputs, they were,

    const int INPUTSIZE = 25;

    int inputSizes[INPUTSIZE] = {1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 
                                5500, 6000, 6500, 7000, 7500, 8000, 8500, 9000, 9500, 
                                10000, 10500, 11000, 11500, 12000, 12500, 13000};

My program generates a csv file after using clock_t to find the time taken by each method for each input.

Graph

The final graph I got was generated by the csv file mentioned above, it looks like this

Where,

X axis is the input size
Y axis is the time taken in seconds
Serial is the serial implementation
Pthread is the pthread implementation
OpenMP is the openMP implementation
Time Taken ... is the time taken for my program to generate an inputSize[i] sized array

Machine Details

Measured time by using the below logic,

    clock_t __start = clock();

    Array = functionName((void *)&packingMetadata);

    clock_t __end = clock();

    double __inputTimeUsed = ((double)(__end - __start)) / CLOCKS_PER_SEC;
    std::cout << "Overall time taken, using " << printStatement << __inputTimeUsed << ".\n";

Where functionName is a function pointer to either the serial version, pthread version, or openmp version. packingMetaData is the packing data as a struct being passed to it, which consists of a pointer to an array created in main() and the array size

Measurements to reduce noise ... not sure what exactly that means
Compilation command is g++ -std=gnu++17 -std=c++17 insertion.cpp -o insertion -lpthread
Compiler is g++
Machine info, Linux Kernel Version: 5.4.0-47-generic OS Type: 64-bit Processors: 8 × Intel® Core™ i7-8665U CPU @ 1.90GHz Memory: 15.5 GiB of RAM

Problems / Questions

My pthread implementation is bad, got it, that's my fault and can be improved. What about the OpenMP implementation? Why does that happen? Why are the outputs so zig zagged?

So the things I wanted to ask were many, but mainly,

Why the spikes and randomness? And
How do I go on seeing how a parallel implementation can be improved further?

measuring runtime requires to pay even more attention to details than usual. How did you measure the time? Did you take enough measurments to reduce noise? How did you compile? Optimizations turned on? Which compiler? What machine is this running on? — 463035818_is_not_an_ai
– 463035818_is_not_an_ai, Commented Sep 17, 2020 at 18:59
frankly, the most confusing about the graph is that x and y are flipped ;) — 463035818_is_not_an_ai
– 463035818_is_not_an_ai, Commented Sep 17, 2020 at 19:00
i have to admit only now I looked closely. It seems that you also got confused by flipped x/y. In your graph pthread consitenlty takes less time for same input size than the other implementations. 1) your data is not particularly noisy. Any measurement is influenced by noise and you can reduce noise by repeating the measurement. 2) Assuming your code is correct and you can provide a minimal reproducible example, maybe you could try at codereview.stackexchange.com — 463035818_is_not_an_ai
– 463035818_is_not_an_ai, Commented Sep 17, 2020 at 19:34
Hey, @idclev463035818, thanks for your suggestions! Just changed the graph ( flipped the axis properly ), please check it out. Added machine info. I was unsure by what you meant by noise. I'm very sorry, I'm new to all this - I don't exactly know what you meant. I'll try to make a minimal reproducible example sometime later on. I have some other things on my hand right now. :/ Btw, thank you for the remarks! Appreciated. — Saif Ul Islam
– Saif Ul Islam, Commented Sep 17, 2020 at 19:43
you have to turn on optimizations -O2 / -O3 plus I-don't-know-what-flags that play well with openmp. You measured runtime of a debug build which is not representative for real runtimes — 463035818_is_not_an_ai
– 463035818_is_not_an_ai, Commented Sep 17, 2020 at 19:50

Jérôme Richard · Accepted Answer · 2020-09-18 07:35:15Z

1

Maybe it could be that my code is buggy. [...]. It does do the sorting well, without race conditions.

Actually, yes, you should check results more closely as an insertion sort cannot be parallelized by just adding an omp for and an omp critical. The current code is not working correctly as the insertion can generate wrong results.

Let us make a step back to understand why it is not so simple. Indeed, we try to insert multiple element in parallel. However, insertion require moving element of the array that are greater. Moving the elements can be hardly done in parallel (actually, not efficiently). Putting the while loop in a critical section will make the all sort sequential. So we need to find another way to parallelize the code. More precisely, we need to find independent work that can run in parallel.

One solution is to look for the position of the element to insert in parallel, then perform a barrier, and then move the element in parallel. However, this solution is not very efficient, especially for small array (due to over-synchronizations and and small work-granularity).

A much better solution would be to divide (virtually) the array in N parts, then perform an independent insertion sorts on each parts in parallel, then make a barrier, and finally merge the sorted arrays.

Note that insertion sorts is generally not efficient compared to other algorithm like quick sort or merge sort on quite-big data. If you aims to reach high performance, it is probably wise to directly implement a parallel quick/merge sort (or even better: a parallel introsort).

answered Sep 18, 2020 at 7:35

Jérôme Richard

53.3k6 gold badges48 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Saif Ul Islam Over a year ago

Yes! I don't think I've mentioned it above in my original post, but the algorithm I went for a pthread implementation, instead of using OpenMP, of this was to break the array down in N parts, sort each part in parallel, then join the threads, and then merge the arrays to give back a completely sorted array. How would you suggest I do this in OpenMP?

Jérôme Richard Over a year ago

One way is to make a parallel loop iterating over 0..threadCount with a static schedule and call a function performing the insertion sort over a range in each loop iteration (#pragma omp parallel for schedule(static)). An implicit synchronization is performed after the loop. Another way (a bit more complex) is to recursively create OpenMP tasks up to a given limit and then merge the results (in the recursive calls). In such a case, tasks should be carefully awaited (using taskgroup or taskwait).

Collectives™ on Stack Overflow

Parallelization Of Insertion Sort With openMP, compared to serial version, gives very spiky results

Context

Code

Inputs

Graph

Machine Details

Problems / Questions

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Context

Code

Inputs

Graph

Machine Details

Problems / Questions

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related