4

I have some code that is parallelized using openMP (on a for loop). I wanted to now repeat the functionality several times and use MPI to submit to a cluster of machines, keeping the intra node stuff to all still be openMP.

When I only use openMP, I get the speed up I expect (using twice the number of processors/cores finishes in half the time). When I add in the MPI and submit to only one MPI process, I do not get this speed up. I created a toy problem to check this and still have the same issue. Here is the code

#include <iostream>
#include <stdio.h>
#include <unistd.h>
#include "mpi.h"

#include <omp.h>


int main(int argc, char *argv[]) {
    int iam=0, np = 1;
    long i;
    int numprocs, rank, namelen;
    char processor_name[MPI_MAX_PROCESSOR_NAME];

    double t1 = MPI_Wtime();
    std::cout << "!!!Hello World!!!" << std::endl; // prints !!!Hello World!!!

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Get_processor_name(processor_name, &namelen);

    int nThread = omp_get_num_procs();//omp_get_num_threads here returns 1??
    printf("nThread = %d\n", nThread);

    int *total = new int[nThread];
    for (int j=0;j<nThread;j++) {
        total[j]=0;
    }
#pragma omp parallel num_threads(nThread) default(shared) private(iam, i)
    {
        np = omp_get_num_threads();

#pragma omp for schedule(dynamic, 1)
        for (i=0; i<10000000; i++) {
            iam = omp_get_thread_num();
            total[iam]++;
        }
        printf("Hello from thread %d out of %d from process %d out of %d on %s\n",
                iam, np, rank, numprocs,processor_name);
    }

    int grandTotal=0;
    for (int j=0;j<nThread;j++) {
        printf("Total=%d\n",total[j]);
        grandTotal += total[j];
    }
    printf("GrandTotal= %d\n", grandTotal);

    MPI_Finalize();

    double t2 = MPI_Wtime();

    printf("time elapsed with MPI clock=%f\n", t2-t1);
    return 0;
}

I am compiling with openmpi-1.8/bin/mpic++, using the -fopenmp flag. Here is my PBS script

#PBS -l select=1:ncpus=12

setenv OMP_NUM_THREADS 12

/util/mpi/openmpi-1.8/bin/mpirun -np 1 -hostfile $PBS_NODEFILE --map-by node:pe=$OMP_NUM_THREADS /workspace/HelloWorldMPI/HelloWorldMPI

I have also tried with #PBS -l nodes=1:ppn=12, get the same results.

When using half the cores, the program is actually faster (twice as fast!). When I reduce the number of cores, I change both ncpus and OMP_NUM_THREADS. I have tried increasing the actual work (adding 10^10 numbers instead of 10^7 shown here in the code). I have tried removing the printf statements wondering if they were somehow slowing things down, still have the same problem. Top shows that I am using all the CPUs (as set in ncpus) close to 100%. If I submit with -np=2, it parallelizes beautifully on two machines, so the MPI seems to be working as expected, but the openMP is broken

Out of ideas now, anything I can try. What am I doing wrong?

6
  • 3
    I am surprised that you are getting any speedup at all. schedule(dynamic,1) is the worst possible choice of loop schedule in that case and constantly writing to neighbouring elements of an array from multiple threads results in lots of cache trashing due to false sharing. That's probably why it runs faster on 6 cores than on 12 (note: one cache line on x86 fits 16 int-egers). Commented Dec 4, 2014 at 19:13
  • 1
    Also, having num_threads(omp_get_num_procs()) effectively makes setting OMP_NUM_THREADS useless as your parallel regions will always run with as many threads as is the number of logical CPUs (e.g. cores, eventually times two with hyperthreading). You won't actually see this in top since --map-by node:pe=X binds your MPI process to the first X CPUs on the node. Commented Dec 4, 2014 at 19:19
  • Each thread should be writing to its own array, not to the same array. This is what I want to do, please point out where I missing the logic. Make as many arrays as there are logical CPUs (omp_get_num_procs()). The on each thread do some work (adding in this case) on an array specific to that thread (total[ithread]). Commented Dec 9, 2014 at 13:53
  • I would like to submit more than one MPI job to a single machine. Let us say I have 16 cores, then I would like to have 2 mpi jobs, each using 8 cores and each job should parallelize over 8 omp threads. Commented Dec 9, 2014 at 13:54
  • You want two MPI processes per node or two MPI jobs per node? Those are two pretty different things. Also, I didn't quite get your first comment about the arrays. Can't you simply use private arrays? Commented Dec 9, 2014 at 22:03

1 Answer 1

3
+50

I hate to say it, but there's a lot wrong and you should probably just familiarize yourself more with OpenMP and MPI. Nevertheless, I'll try to go through your code and point out the errors I saw.

double t1 = MPI_Wtime();

Starting out: Calling MPI_Wtime() before MPI_Init() is undefined. I'll also add that if you want to do this sort of benchmark with MPI, a good idea is to put a MPI_Barrier() before the call to Wtime so that all the tasks enter the section at the same time.

//omp_get_num_threads here returns 1??

The reason why omp_get_num_threads() returns 1 is that you are not in a parallel region.

#pragma omp parallel num_threads(nThread)

You set num_threads to nThread here which as Hristo Iliev mentioned, effectively ignores any input through the OMP_NUM_THREADS environment variable. You can usually just leave num_threads out and be ok for this sort of simplified problem.

default(shared)

The behavior of variables in the parallel region is by default shared, so there's no reason to have default(shared) here.

private(iam, i)

I guess it's your coding style, but instead of making iam and i private, you could just declare them within the parallel region, which will automatically make them private (and considering you don't really use them outside of it, there's not much reason not to).

#pragma omp for schedule(dynamic, 1)

Also as Hristo Iliev mentioned, using schedule(dynamic, 1) for this problem set in particular is not the best of ideas, since each iteration of your loop takes virtually no time and the total problem size is fixed.

int grandTotal=0;
for (int j=0;j<nThread;j++) {
    printf("Total=%d\n",total[j]);
    grandTotal += total[j];
}

Not necessarily an error, but your allocation of the total array and summation at the end is better accomplished using the OpenMP reduction directive.

double t2 = MPI_Wtime();

Similar to what you did with MPI_Init(), calling MPI_Wtime() after you've called MPI_Finalize() is undefined, and should be avoided if possible.

Note: If you are somewhat familiar with what OpenMP is, this is a good reference and basically everything I explained here about OpenMP is in there.

With that out of the way, I have to note you didn't actually do anything with MPI, besides output the rank and comm size. Which is to say, all the MPI tasks do a fixed amount of work each, regardless of the number tasks. Since there's no decrease in work-per-task for an increasing number of MPI tasks, you wouldn't expect to have any scaling, would you? (Note: this is actually what's called Weak Scaling, but since you have no communication via MPI, there's no reason to expect it to not scale perfectly).

Here's your code rewritten with some of the changes I mentioned:

#include <iostream>
#include <cstdio>
#include <cstdlib>

#include <mpi.h>
#include <omp.h>

int main(int argc, char *argv[])
{
    MPI_Init(&argc, &argv);

    int world_size,
        world_rank;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    int name_len;
    char proc_name[MPI_MAX_PROCESSOR_NAME];
    MPI_Get_processor_name(proc_name, &name_len);

    MPI_Barrier(MPI_COMM_WORLD);
    double t_start = MPI_Wtime();

    // we need to scale the work per task by number of mpi threads,
    // otherwise we actually do more work with the more tasks we have
    const int n_iterations = 1e7 / world_size;

    // actually we also need some dummy data to add so the compiler doesn't just
    // optimize out the work loop with -O3 on
    int data[16];
    for (int i = 0; i < 16; ++i)
        data[i] = rand() % 16;

    // reduction(+:total) means that all threads will make a private
    // copy of total at the beginning of this construct and then
    // do a reduction operation with the + operator at the end (aka sum them
    // all together)
    unsigned int total = 0;
    #pragma omp parallel reduction(+:total)
    {
        // both of these calls will execute properly since we
        // are in an omp parallel region
        int n_threads = omp_get_num_threads(),
            thread_id = omp_get_thread_num();

        // note: this code will only execute on a single thread (per mpi task)
        #pragma omp master
        {
            printf("nThread = %d\n", n_threads);
        }

        #pragma omp for
        for (int i = 0; i < n_iterations; i++)
            total += data[i % 16];

        printf("Hello from thread %d out of %d from process %d out of %d on %s\n",
                thread_id, n_threads, world_rank, world_size, proc_name);
    }

    // do a reduction with MPI, otherwise the data we just calculated is useless
    unsigned int grand_total;
    MPI_Allreduce(&total, &grand_total, 1, MPI_UNSIGNED, MPI_SUM, MPI_COMM_WORLD);

    // another barrier to make sure we wait for the slowest task
    MPI_Barrier(MPI_COMM_WORLD);
    double t_end = MPI_Wtime();

    // output individual thread totals
    printf("Thread total = %d\n", total);

    // output results from a single thread
    if (world_rank == 0)
    {
        printf("Grand Total = %d\n", grand_total);
        printf("Time elapsed with MPI clock = %f\n", t_end - t_start);
    }

    MPI_Finalize();
    return 0;
}

Another thing to note, my version of your code executed 22 times slower with schedule(dynamic, 1) added, just to show you how it can impact performance when used incorrectly.

Unfortunately I'm not too familiar with PBS, as the clusters I use run with SLURM but an example sbatch file for a job running on 3 nodes, on a system with two 6-core processors per node, might look something like this:

#!/bin/bash
#SBATCH --job-name=TestOrSomething
#SBATCH --export=ALL
#SBATCH --partition=debug
#SBATCH --nodes=3
#SBATCH --ntasks-per-socket=1

# set 6 processes per thread here
export OMP_NUM_THREADS=6

# note that this will end up running 3 * (however many cpus 
#   are on a single node) mpi tasks, not just 3. Additionally
#   the below line might use `mpirun` instead depending on the
#   cluster
srun ./a.out

For fun, I also just ran my version on a cluster to test the scaling for MPI and OMP, and got the following (note the log scales):

Scaling (time) for the example code. It's basically perfect.

As you can see, its basically perfect. Actually, 1-16 is 1 MPI task with 1-16 OMP threads, and 16-256 is 1-16 MPI tasks with 16 threads per task, so you can also see that there's no change in behavior between the MPI scaling and OMP scaling.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks I will try this out. Clearly this is a toy problem and I need the private(iam, i) for a different application which uses iam and i outside the openmp loop and does things a lot more complicated than adding up a bunch of numbers. Similar reason for not using omp parallel reduction. But I will try the other suggestions.
@Anu If none of the suggestions end up working on the actual problem, I'd try to simplify the real code as much as you can and post another question, sometimes it's just hard to reproduce the issue in a toy problem.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.