1

Context

Hello!

I was finishing up a slight hands on experience with POSIX pthreads, and OpenMP, trying to compare the both with the serial implementation of Insertion Sort, trying to see which one performs well on what kind of inputs.

The results I got were weird to say the least.

Maybe it could be that my code is buggy. I would say that my pthread implementation sure is and needs a lot of more work since it's a very lousy implementation. However, taking a look at the time it took for OpenMP and Serial, I saw very weird trends.

Code

Before I show you the graph of the time taken that I got, here is the code for each.

For Serial,


static inline void *
insertionSort(void *arrayMetaDataToUnpack)
{
    // UNPACK START
    ull *_Array = ((InsertionSortPacking *)arrayMetaDataToUnpack)->Array;
    size_t sizeOfArray = ((InsertionSortPacking *)arrayMetaDataToUnpack)->ArraySize;
    // UNPACK END

    // Compulsory use of reinterpret_cast to keep everything consistent
    ull *__tempArray = reinterpret_cast<ull *>(_Array);

    for (int i = 1; i < sizeOfArray; i++)
    {
        int __temp = __tempArray[i];
        int j = i - 1;

        while (j >= 0 && __temp <= __tempArray[j])
        {
            __tempArray[j + 1] = __tempArray[j];
            j = j - 1;
        }

        {
            __tempArray[j + 1] = __temp;
        }
    }

    return __tempArray;
}
static inline void *
insertionSort(void *arrayMetaDataToUnpack)
{
    // UNPACK START
    ull *_Array = ((InsertionSortPacking *)arrayMetaDataToUnpack)->Array;
    size_t sizeOfArray = ((InsertionSortPacking *)arrayMetaDataToUnpack)->ArraySize;
    // UNPACK END

    // Compulsory use of reinterpret_cast to keep everything consistent
    ull *__tempArray = reinterpret_cast<ull *>(_Array);

    for (int i = 1; i < sizeOfArray; i++)
    {
        int __temp = __tempArray[i];
        int j = i - 1;

        while (j >= 0 && __temp <= __tempArray[j])
        {
            __tempArray[j + 1] = __tempArray[j];
            j = j - 1;
        }

        {
            __tempArray[j + 1] = __temp;
        }
    }

    return __tempArray;
}

The OpenMP implementation,

static inline void *
openmp_insertionSort(void *arrayMetaDataToUnpack)
{
    // UNPACK START
    ull *_Array = ((InsertionSortPacking *)arrayMetaDataToUnpack)->Array;
    size_t sizeOfArray = ((InsertionSortPacking *)arrayMetaDataToUnpack)->ArraySize;
    // UNPACK END

    // Compulsory use of reinterpret_cast to keep everything consistent
    ull *__tempArray = reinterpret_cast<ull *>(_Array);

#pragma omp for
    for (int i = 1; i < sizeOfArray; i++)
    {
        int __temp = __tempArray[i];
        int j = i - 1;

        while (j >= 0 && __temp <= __tempArray[j])
        {
            __tempArray[j + 1] = __tempArray[j];
            j = j - 1;
        }

#pragma omp critical
        {
            __tempArray[j + 1] = __temp;
        }
    }

    return __tempArray;
}

Again, disclaimer, I'm new to OpenMP, and I might have parallelized the code well, but this is all I know at a beginner stage to use it. It does do the sorting well, without race conditions.

Inputs

So for the inputs, they were,

    const int INPUTSIZE = 25;

    int inputSizes[INPUTSIZE] = {1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 
                                5500, 6000, 6500, 7000, 7500, 8000, 8500, 9000, 9500, 
                                10000, 10500, 11000, 11500, 12000, 12500, 13000}; 

My program generates a csv file after using clock_t to find the time taken by each method for each input.

Graph

The final graph I got was generated by the csv file mentioned above, it looks like this

Graph here

Where,

  • X axis is the input size
  • Y axis is the time taken in seconds
  • Serial is the serial implementation
  • Pthread is the pthread implementation
  • OpenMP is the openMP implementation
  • Time Taken ... is the time taken for my program to generate an inputSize[i] sized array

Machine Details

  • Measured time by using the below logic,
    clock_t __start = clock();

    Array = functionName((void *)&packingMetadata);

    clock_t __end = clock();

    double __inputTimeUsed = ((double)(__end - __start)) / CLOCKS_PER_SEC;
    std::cout << "Overall time taken, using " << printStatement << __inputTimeUsed << ".\n";

Where functionName is a function pointer to either the serial version, pthread version, or openmp version. packingMetaData is the packing data as a struct being passed to it, which consists of a pointer to an array created in main() and the array size

  • Measurements to reduce noise ... not sure what exactly that means

  • Compilation command is g++ -std=gnu++17 -std=c++17 insertion.cpp -o insertion -lpthread

  • Compiler is g++

  • Machine info, Linux Kernel Version: 5.4.0-47-generic OS Type: 64-bit Processors: 8 × Intel® Core™ i7-8665U CPU @ 1.90GHz Memory: 15.5 GiB of RAM

Problems / Questions

My pthread implementation is bad, got it, that's my fault and can be improved. What about the OpenMP implementation? Why does that happen? Why are the outputs so zig zagged?

So the things I wanted to ask were many, but mainly,

  1. Why the spikes and randomness? And
  2. How do I go on seeing how a parallel implementation can be improved further?
6
  • 2
    measuring runtime requires to pay even more attention to details than usual. How did you measure the time? Did you take enough measurments to reduce noise? How did you compile? Optimizations turned on? Which compiler? What machine is this running on? Commented Sep 17, 2020 at 18:59
  • 4
    frankly, the most confusing about the graph is that x and y are flipped ;) Commented Sep 17, 2020 at 19:00
  • 1
    i have to admit only now I looked closely. It seems that you also got confused by flipped x/y. In your graph pthread consitenlty takes less time for same input size than the other implementations. 1) your data is not particularly noisy. Any measurement is influenced by noise and you can reduce noise by repeating the measurement. 2) Assuming your code is correct and you can provide a minimal reproducible example, maybe you could try at codereview.stackexchange.com Commented Sep 17, 2020 at 19:34
  • Hey, @idclev463035818, thanks for your suggestions! Just changed the graph ( flipped the axis properly ), please check it out. Added machine info. I was unsure by what you meant by noise. I'm very sorry, I'm new to all this - I don't exactly know what you meant. I'll try to make a minimal reproducible example sometime later on. I have some other things on my hand right now. :/ Btw, thank you for the remarks! Appreciated. Commented Sep 17, 2020 at 19:43
  • 2
    you have to turn on optimizations -O2 / -O3 plus I-don't-know-what-flags that play well with openmp. You measured runtime of a debug build which is not representative for real runtimes Commented Sep 17, 2020 at 19:50

1 Answer 1

1

Maybe it could be that my code is buggy. [...]. It does do the sorting well, without race conditions.

Actually, yes, you should check results more closely as an insertion sort cannot be parallelized by just adding an omp for and an omp critical. The current code is not working correctly as the insertion can generate wrong results.

Let us make a step back to understand why it is not so simple. Indeed, we try to insert multiple element in parallel. However, insertion require moving element of the array that are greater. Moving the elements can be hardly done in parallel (actually, not efficiently). Putting the while loop in a critical section will make the all sort sequential. So we need to find another way to parallelize the code. More precisely, we need to find independent work that can run in parallel.

One solution is to look for the position of the element to insert in parallel, then perform a barrier, and then move the element in parallel. However, this solution is not very efficient, especially for small array (due to over-synchronizations and and small work-granularity).

A much better solution would be to divide (virtually) the array in N parts, then perform an independent insertion sorts on each parts in parallel, then make a barrier, and finally merge the sorted arrays.

Note that insertion sorts is generally not efficient compared to other algorithm like quick sort or merge sort on quite-big data. If you aims to reach high performance, it is probably wise to directly implement a parallel quick/merge sort (or even better: a parallel introsort).

Sign up to request clarification or add additional context in comments.

2 Comments

Yes! I don't think I've mentioned it above in my original post, but the algorithm I went for a pthread implementation, instead of using OpenMP, of this was to break the array down in N parts, sort each part in parallel, then join the threads, and then merge the arrays to give back a completely sorted array. How would you suggest I do this in OpenMP?
One way is to make a parallel loop iterating over 0..threadCount with a static schedule and call a function performing the insertion sort over a range in each loop iteration (#pragma omp parallel for schedule(static)). An implicit synchronization is performed after the loop. Another way (a bit more complex) is to recursively create OpenMP tasks up to a given limit and then merge the results (in the recursive calls). In such a case, tasks should be carefully awaited (using taskgroup or taskwait).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.