Context
Hello!
I was finishing up a slight hands on experience with POSIX pthreads, and OpenMP, trying to compare the both with the serial implementation of Insertion Sort, trying to see which one performs well on what kind of inputs.
The results I got were weird to say the least.
Maybe it could be that my code is buggy. I would say that my pthread implementation sure is and needs a lot of more work since it's a very lousy implementation. However, taking a look at the time it took for OpenMP and Serial, I saw very weird trends.
Code
Before I show you the graph of the time taken that I got, here is the code for each.
For Serial,
static inline void *
insertionSort(void *arrayMetaDataToUnpack)
{
// UNPACK START
ull *_Array = ((InsertionSortPacking *)arrayMetaDataToUnpack)->Array;
size_t sizeOfArray = ((InsertionSortPacking *)arrayMetaDataToUnpack)->ArraySize;
// UNPACK END
// Compulsory use of reinterpret_cast to keep everything consistent
ull *__tempArray = reinterpret_cast<ull *>(_Array);
for (int i = 1; i < sizeOfArray; i++)
{
int __temp = __tempArray[i];
int j = i - 1;
while (j >= 0 && __temp <= __tempArray[j])
{
__tempArray[j + 1] = __tempArray[j];
j = j - 1;
}
{
__tempArray[j + 1] = __temp;
}
}
return __tempArray;
}
static inline void *
insertionSort(void *arrayMetaDataToUnpack)
{
// UNPACK START
ull *_Array = ((InsertionSortPacking *)arrayMetaDataToUnpack)->Array;
size_t sizeOfArray = ((InsertionSortPacking *)arrayMetaDataToUnpack)->ArraySize;
// UNPACK END
// Compulsory use of reinterpret_cast to keep everything consistent
ull *__tempArray = reinterpret_cast<ull *>(_Array);
for (int i = 1; i < sizeOfArray; i++)
{
int __temp = __tempArray[i];
int j = i - 1;
while (j >= 0 && __temp <= __tempArray[j])
{
__tempArray[j + 1] = __tempArray[j];
j = j - 1;
}
{
__tempArray[j + 1] = __temp;
}
}
return __tempArray;
}
The OpenMP implementation,
static inline void *
openmp_insertionSort(void *arrayMetaDataToUnpack)
{
// UNPACK START
ull *_Array = ((InsertionSortPacking *)arrayMetaDataToUnpack)->Array;
size_t sizeOfArray = ((InsertionSortPacking *)arrayMetaDataToUnpack)->ArraySize;
// UNPACK END
// Compulsory use of reinterpret_cast to keep everything consistent
ull *__tempArray = reinterpret_cast<ull *>(_Array);
#pragma omp for
for (int i = 1; i < sizeOfArray; i++)
{
int __temp = __tempArray[i];
int j = i - 1;
while (j >= 0 && __temp <= __tempArray[j])
{
__tempArray[j + 1] = __tempArray[j];
j = j - 1;
}
#pragma omp critical
{
__tempArray[j + 1] = __temp;
}
}
return __tempArray;
}
Again, disclaimer, I'm new to OpenMP, and I might have parallelized the code well, but this is all I know at a beginner stage to use it. It does do the sorting well, without race conditions.
Inputs
So for the inputs, they were,
const int INPUTSIZE = 25;
int inputSizes[INPUTSIZE] = {1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000,
5500, 6000, 6500, 7000, 7500, 8000, 8500, 9000, 9500,
10000, 10500, 11000, 11500, 12000, 12500, 13000};
My program generates a csv file after using clock_t to find the time taken by each method for each input.
Graph
The final graph I got was generated by the csv file mentioned above, it looks like this
Where,
- X axis is the input size
- Y axis is the time taken in seconds
- Serial is the serial implementation
- Pthread is the pthread implementation
- OpenMP is the openMP implementation
- Time Taken ... is the time taken for my program to generate an inputSize[i] sized array
Machine Details
- Measured time by using the below logic,
clock_t __start = clock();
Array = functionName((void *)&packingMetadata);
clock_t __end = clock();
double __inputTimeUsed = ((double)(__end - __start)) / CLOCKS_PER_SEC;
std::cout << "Overall time taken, using " << printStatement << __inputTimeUsed << ".\n";
Where functionName is a function pointer to either the serial version, pthread version, or openmp version. packingMetaData is the packing data as a struct being passed to it, which consists of a pointer to an array created in main() and the array size
Measurements to reduce noise ... not sure what exactly that means
Compilation command is
g++ -std=gnu++17 -std=c++17 insertion.cpp -o insertion -lpthreadCompiler is
g++Machine info, Linux Kernel Version: 5.4.0-47-generic OS Type: 64-bit Processors: 8 × Intel® Core™ i7-8665U CPU @ 1.90GHz Memory: 15.5 GiB of RAM
Problems / Questions
My pthread implementation is bad, got it, that's my fault and can be improved. What about the OpenMP implementation? Why does that happen? Why are the outputs so zig zagged?
So the things I wanted to ask were many, but mainly,
- Why the spikes and randomness? And
- How do I go on seeing how a parallel implementation can be improved further?

-O2/-O3plus I-don't-know-what-flags that play well with openmp. You measured runtime of a debug build which is not representative for real runtimes