0

I'm tinkering with OpenMP and trying to implement a parallelized version of quicksort.
I have implemented a version that takes as pivot always the first element, a parallelized version of it, a version that randomizes the pivot by choosing the median od three random elements, and a parallelized version it.
What bothers me is the fact that I get a good speed up with the first parallelization, whereas the second (despite being parallelized in the same way) is slower than the sequential counterpart.
In both cases I parallelize only the first invocation of the function, I know that I could parallelize more in depth in the three of recursions, but the point is that I expected to get the same speedup from the two parallelizations.

Here's the code snippet for the "naive" (no randomization) partitioning function:

int partition(vector<int>& v, int p, int q){
  int x = v[p];
  int i = p;
  for(int j = p+1; j <= q; j++){
    if(v[j] <= x){
      i++;
      swap(v[i], v[j]);
    }
  }
  swap(v[i], v[p]);
  return i;
}

This is the randomized partitioning function:

int rand_median(const vector<int>& v, int p, int q){
  int n1 = (rand() % (p - q)) + p;
  int n2 = (rand() % (p - q)) + p;
  int n3 = (rand() % (p - q)) + p;
  if((v[n1] <= v[n2] && v[n1] >= v[n3]) || (v[n1] <= v[n3] && v[n1] >= v[n2])) return n1;
  else if ((v[n2] <= v[n1] && v[n2] >= v[n3]) || (v[n2] <= v[n3] && v[n2] >= v[n1])) return n2;
  else return n3;
}

int rand_partition(vector<int>& v, int p, int q){
  int pivot = rand_median(v,p,q);
  swap(v[p], v[pivot]);
  int x = v[p];
  int i = p;
  for(int j = p+1; j <= q; j++){
    if(v[j] <= x){
      i++;
      swap(v[i], v[j]);
    }
  }
  swap(v[i], v[p]);
  return i;
}

Naive quicksort:

void quicksort(vector<int>& v, int s, int e){
  if(s >= e) return;
  int p = partition(v,s,e);
  quicksort(v,s,p-1);
  quicksort(v,p+1,e);
}

Parallelized naive quicksort:

void quick_par(vector<int>& v, int s, int e){
  if(s >= e) return;
  int p = partition(v,s,e);
  omp_set_num_threads(2);
#pragma omp parallel sections
  {
    #pragma omp section
    quicksort(v,s,p-1);
    #pragma omp section
    quicksort(v,p+1,e);
  }
}

Randomized quicksort:

void quick_rand(vector<int>& v, int s, int e){
  if(s >= e) return;
  int p = rand_partition(v,s,e);
  quick_rand(v,s,p-1);
  quick_rand(v,p+1,e);
}

Parallelized randomized quicksort:

void quick_par_rand(vector<int>& v, int s, int e){
  if(s >= e) return;
  int p = rand_partition(v,s,e);
  omp_set_num_threads(2);

#pragma omp parallel sections
  {
    #pragma omp section
    quick_rand(v,s,p-1);
    #pragma omp section
    quick_rand(v,p+1,e);
  }
}

And here are the results obtained with Google benchmark:

bench_ser       887282457 ns    887038659 ns           10 //naive quicksort
bench_par        10738723 ns     10734826 ns           70 //parallelized naive
bench_rand         613904 ns       613686 ns         1039 //randomized quicksort
bench_par_rand    3249751 ns      3248460 ns          213 //parallelized randomized
bench_sort         106110 ns       106074 ns         5952 //std::sort

As you can see the parallelized randomized version is slower than the sequential one. Here's the pastebin of the whole code I've used.

1 Answer 1

1

The parallel version bench_par_rand is incorrect: it uses the rand which is not thread safe. It results in a race condition. Thus, the results may not be random (a critical point needed for a fast quick-sort) and the code much slower since the threads will constantly try to modify the shared internal state (iterative seed) of the rand function. Consider using the C++11 random number generators if possible (one per thread).

A quick workaround could be to use the thread_local storage and the C++11 random number generators together and rename all rand() by safe_rand(). Here is an example:

thread_local std::uniform_int_distribution<int> distrib(0, RAND_MAX);
thread_local std::mt19937 rdGen;
thread_local auto safe_rand = std::bind(distrib, rdGen);

Using local variables rather than global thread_local (and using a specific uniform_int_distribution rather than modulus) improves performance although this is a bit more tedious to do.

Here are the timings:

Base version:
 - bench_rand: 582499 ns
 - bench_par_rand: 2765300 ns

Fixed version with thread_local:
 - bench_rand: 1109800 ns
 - bench_par_rand: 737799 ns

Fixed version with local variables:
 - bench_rand: 798699 ns
 - bench_par_rand: 572300 ns

The last fixed parallel version is much faster! However, the last fixed sequential version is also slower than before. I think this i due to a slower random generator. Finally, there is no cut-off method in your code (switch from quick-sort to a faster algorithm for small arrays). Thus, the cost of the random generator is largely exhibited.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.