1

I've been testing parallel sorting using OpenMP. I implemented odd-even sorting algorithm that is 3x faster than without OpenMP. However, std::sort is still way more faster: seq - 100s, parallel - 20s, std::sort - 0.05s

I tried moving #pragma omp parallel for to i-cycle but that worked even worse and did not sort the vector

for (int i = 0; i < 100000; i++) {
        #pragma omp parallel for
        for (int j = (i % 2) ? 0 : 1; j < 100000 - 1; j += 2) {
            if (vec_[j] > vec_[j + 1]) {
                std::swap(vec_[j], vec_[j + 1]);
            }
        }
    }

Tbh, I expected parallel odd-even sort to be the fastest but now I wonder - am I doing anything wrong or it is just std::sort is SO efficient?

11
  • 2
    how many threads? If the sequential version takes only 0.05 seconds I would not expect any speed up from putting more threads, I'd rather expect to see the overhead coming from using threads Commented May 10, 2019 at 13:21
  • 3
    Is the complexity of your algorithm logarithmic? From the surface, it looks like a plain old (slow, O(n*n)) bubble sort with a few tweaks. std::sort is logarithmic in nature. Commented May 10, 2019 at 13:23
  • 5
    If your O(n^2) algorithm takes 100 [s] when run single-threaded, you would need 2000 threads to reduce it to 0.05 [s]. In the ideal case with perfect linear speedup. Commented May 10, 2019 at 13:45
  • 1
    Also the fact that you're testing with 100k elements... Maybe bump it up to 100M elements and you'll have a significant enough workload to see a difference. Commented May 10, 2019 at 13:46
  • 1
    Here is a plot that shows the difference in behaviour between n*n/8 and n*log(n). Commented May 10, 2019 at 14:38

1 Answer 1

10

Your algorithm is O(n^2) in total work done. Using k threads, this at most makes it O(n^2/k).

std::sort is O(n lg n). Unless k is O( n / lg n ) adding more threads won't keep up.

And even if you did have piles of threads. NUMA on most mega-thread systems means that your memory is going to be a serious pain. The threads don't access the same memory in each cycle, and in fact constantly pass data back and forth.

An example of a way that might speed up your work compared to a simple std::sort would be to split the data into K pieces, std::sort each of the K pieces, then do a parallel merge of those pieces.

int data_count = 100000;
auto get_block_edge = [data_count](int i, int n) {
  return data_count * i / n;
};
int blocks = 0;
#pragma omp parallel
{
  blocks = omp_get_num_threads();
  int block = omp_get_thread_num();
  int start = get_block_edge( block, blocks );
  int finish = get_block_edge( block+1, blocks );
  std::sort( std::begin(vec_)+start, std::begin(vec_)+finish );
}

now we have a bunch of sorted blocks. You just need to merge them.

for (int merge_step = 1; merge_step < blocks; merge_step *= 2 )
{
  #pragma omp parallel for
  for (int i = 0; i < (blocks/2/merge_step); ++i) {
    int start = get_block_edge(i*2*merge_step, blocks);
    int mid = get_block_edge(i*2*merge_step+merge_step, blocks);
    int finish = get_block_edge(i*2*merge_step+2*merge_step, blocks);
    finish = std::min(finish, data_count);
    auto b = std::begin(vec_);
    std::inplace_merge( b+start, b+mid, b+finish );
  }
}

I think that should be a highly parallel merge. Or, more likely, segfault because I have an off-by-1 error.

Now, this is still O(n) with an infinite number of threads, as the very last merge has to be single-threaded. Getting around that is, to say it mildly, tricky.

Sign up to request clarification or add additional context in comments.

7 Comments

For this you will probably have to increase the size of the dataset to see the parallel gain since 0.05 seconds (time to sort the whole thing using std::sort) is already small.
@drescherjm 1/20th of a second is an eon for pure data manipulation. I'd be surprised if we couldn't make it faster with careful use of threads. Maybe not the code above, but still.
I was thinking of the cost to create the threads. Although this (at least some of the answers) tells me I don't have to worry so much about that: stackoverflow.com/questions/3929774/…
@drescherjm There is nothing requiring threads to be created there. Idle threads in a thread pool could be used (if not by OpenMP (I cannot guarantee what OpenMP does), then by a different threaded sort implementation). There is a required inter-thread communication overhead, but that is small on the scale of 0.05 seconds.
@nowox yep - n and i where backwards. Swapped them. Dyslexic math. Refardless, there are likely to be other problems as I wrote a parallel sorting algorithm in my head using unfamiliar threading primitives without compiling or testing it. Do not treat it as safe code! It is pseudo code that looks like C++ omp code.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.