5

I have an implementation of parallel bubble sort algorithm(Odd-Even transposition sort) in C, using OpenMP. However, after I tested it it's slower than the serial version(by about 10%) although I have a 4 cores processor ( 2 real x 2 because of Intel hyperthreading). I have checked to see if the cores are actually used and I can see them at 100% each when running the program. Therefore I think I did a mistake in the implementation the algorithm.

I am using linux with kernel 2.6.38-8-generic.

This is how I compile:

gcc -o bubble-sort bubble-sort.c -Wall -fopenmp or

gcc -o bubble-sort bubble-sort.c -Wall -fopenmp for the serial version

This is how i run:

./bubble-sort < in_10000 > out_10000

#include <omp.h>
#include <stdio.h>
#include <time.h>
#include <stdlib.h>

int main()
{
        int i, n, tmp, *x, changes;
        int chunk;
        scanf("%d ", &n);
        chunk = n / 4;
        x = (int*) malloc(n * sizeof(int));
        for(i = 0; i < n; ++i)
            scanf("%d ", &x[i]);
    changes = 1;
    int nr = 0;
    while(changes)
    {
    #pragma omp parallel private(tmp)
    {
            nr++;
            changes = 0;
            #pragma omp for \
                    reduction(+:changes)
            for(i = 0; i < n - 1; i = i + 2)
            {
                    if(x[i] > x[i+1] )
                    {
                            tmp = x[i];
                            x[i] = x[i+1];
                            x[i+1] = tmp;
                            ++changes;
                    }
            }
            #pragma omp for \
                    reduction(+:changes)
            for(i = 1; i < n - 1; i = i + 2)
            {
                    if( x[i] > x[i+1] )
                    {
                            tmp = x[i];
                            x[i] = x[i+1];
                            x[i+1] = tmp;
                            ++changes;
                    }
            }
    }
    }

    return 0;
}

Later edit:

It seems to work well now after I made the changes you suggested. It also scales pretty good(I tested on 8 physical cores too -> took 21s for a set of 150k numbers which is far less than on one core). However if I set the OMP_SCHEDULE environment variable myself the performance decreases...

3
  • 2
    I't pretty solidly on 1 core with my quadcore cpu. Also, what where you thinking doing scanf in a parallel for loop? Commented Nov 3, 2011 at 15:47
  • Sigh. Forget about the 1-core thing: it's due to pragma's being silently ignored without -fopenmp compile flag. Silly me Commented Nov 3, 2011 at 15:54
  • I have forgot to remove the parallel section in the scanf for loop(used an older version of code). It's not the way I tested. Commented Nov 3, 2011 at 16:14

1 Answer 1

3

You should profile it and check where threads spend time.

One possible reason is that parallel regions are constantly created and destroyed; depending on OpenMP implementation, it could lead to re-creation of the thread pool, though good implementations should probably handle this case.

Some small things to shave off:
- ok seems completely unnecessary, you can just change the loop exit condition to i<n-1;
- explicit barrier is unnecessary - first, you put it out of parallel regions so it makes no sense; and second, OpenMP parallel regions and loops have implicit barriers at the end;
- combine at least the two consequent parallel regions inside the while loop:

#pragma omp parallel private(tmp)
{
    #pragma omp for bla-bla
    for (i=0; i<n-1; i+=2 ) {...}

    #pragma omp for bla-bla
    for (i=1; i<n-1; i+=2 ) {...}
}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.