I hate to say it, but there's a lot wrong and you should probably just familiarize
yourself more with OpenMP and MPI. Nevertheless, I'll try to go through your code
and point out the errors I saw.
double t1 = MPI_Wtime();
Starting out: Calling MPI_Wtime() before MPI_Init() is undefined. I'll also add that if you
want to do this sort of benchmark with MPI, a good idea is to put a MPI_Barrier() before
the call to Wtime so that all the tasks enter the section at the same time.
//omp_get_num_threads here returns 1??
The reason why omp_get_num_threads() returns 1 is that you are not in a
parallel region.
#pragma omp parallel num_threads(nThread)
You set num_threads to nThread here which as Hristo Iliev mentioned, effectively
ignores any input through the OMP_NUM_THREADS environment variable. You can usually just
leave num_threads out and be ok for this sort of simplified problem.
default(shared)
The behavior of variables in the parallel region is by default shared, so there's
no reason to have default(shared) here.
private(iam, i)
I guess it's your coding style, but instead of making iam and i private, you could
just declare them within the parallel region, which will automatically make them private
(and considering you don't really use them outside of it, there's not much reason not to).
#pragma omp for schedule(dynamic, 1)
Also as Hristo Iliev mentioned, using schedule(dynamic, 1) for this problem set in particular
is not the best of ideas, since each iteration of your loop takes virtually no time
and the total problem size is fixed.
int grandTotal=0;
for (int j=0;j<nThread;j++) {
printf("Total=%d\n",total[j]);
grandTotal += total[j];
}
Not necessarily an error, but your allocation of the total array and summation at the end
is better accomplished using the OpenMP reduction directive.
double t2 = MPI_Wtime();
Similar to what you did with MPI_Init(), calling MPI_Wtime() after you've
called MPI_Finalize() is undefined, and should be avoided if possible.
Note: If you are somewhat familiar with what OpenMP is, this
is a good reference and basically everything I explained here about OpenMP is in there.
With that out of the way, I have to note you didn't actually do anything with MPI,
besides output the rank and comm size. Which is to say, all the MPI tasks
do a fixed amount of work each, regardless of the number tasks. Since there's
no decrease in work-per-task for an increasing number of MPI tasks, you wouldn't expect
to have any scaling, would you? (Note: this is actually what's called Weak Scaling, but since you have no communication via MPI, there's no reason to expect it to not
scale perfectly).
Here's your code rewritten with some of the changes I mentioned:
#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <mpi.h>
#include <omp.h>
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int world_size,
world_rank;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int name_len;
char proc_name[MPI_MAX_PROCESSOR_NAME];
MPI_Get_processor_name(proc_name, &name_len);
MPI_Barrier(MPI_COMM_WORLD);
double t_start = MPI_Wtime();
// we need to scale the work per task by number of mpi threads,
// otherwise we actually do more work with the more tasks we have
const int n_iterations = 1e7 / world_size;
// actually we also need some dummy data to add so the compiler doesn't just
// optimize out the work loop with -O3 on
int data[16];
for (int i = 0; i < 16; ++i)
data[i] = rand() % 16;
// reduction(+:total) means that all threads will make a private
// copy of total at the beginning of this construct and then
// do a reduction operation with the + operator at the end (aka sum them
// all together)
unsigned int total = 0;
#pragma omp parallel reduction(+:total)
{
// both of these calls will execute properly since we
// are in an omp parallel region
int n_threads = omp_get_num_threads(),
thread_id = omp_get_thread_num();
// note: this code will only execute on a single thread (per mpi task)
#pragma omp master
{
printf("nThread = %d\n", n_threads);
}
#pragma omp for
for (int i = 0; i < n_iterations; i++)
total += data[i % 16];
printf("Hello from thread %d out of %d from process %d out of %d on %s\n",
thread_id, n_threads, world_rank, world_size, proc_name);
}
// do a reduction with MPI, otherwise the data we just calculated is useless
unsigned int grand_total;
MPI_Allreduce(&total, &grand_total, 1, MPI_UNSIGNED, MPI_SUM, MPI_COMM_WORLD);
// another barrier to make sure we wait for the slowest task
MPI_Barrier(MPI_COMM_WORLD);
double t_end = MPI_Wtime();
// output individual thread totals
printf("Thread total = %d\n", total);
// output results from a single thread
if (world_rank == 0)
{
printf("Grand Total = %d\n", grand_total);
printf("Time elapsed with MPI clock = %f\n", t_end - t_start);
}
MPI_Finalize();
return 0;
}
Another thing to note, my version of your code executed 22 times slower with schedule(dynamic, 1) added, just to show you how it can impact performance when used incorrectly.
Unfortunately I'm not too familiar with PBS, as the clusters I use run with SLURM but an example sbatch file for a job running on 3 nodes, on a system with two 6-core processors per node, might look something like this:
#!/bin/bash
#SBATCH --job-name=TestOrSomething
#SBATCH --export=ALL
#SBATCH --partition=debug
#SBATCH --nodes=3
#SBATCH --ntasks-per-socket=1
# set 6 processes per thread here
export OMP_NUM_THREADS=6
# note that this will end up running 3 * (however many cpus
# are on a single node) mpi tasks, not just 3. Additionally
# the below line might use `mpirun` instead depending on the
# cluster
srun ./a.out
For fun, I also just ran my version on a cluster to test the scaling for MPI and OMP, and got the following (note the log scales):

As you can see, its basically perfect. Actually, 1-16 is 1 MPI task with 1-16 OMP threads, and 16-256 is 1-16 MPI tasks with 16 threads per task, so you can also see that there's no change in behavior between the MPI scaling and OMP scaling.
schedule(dynamic,1)is the worst possible choice of loop schedule in that case and constantly writing to neighbouring elements of an array from multiple threads results in lots of cache trashing due to false sharing. That's probably why it runs faster on 6 cores than on 12 (note: one cache line on x86 fits 16int-egers).num_threads(omp_get_num_procs())effectively makes settingOMP_NUM_THREADSuseless as your parallel regions will always run with as many threads as is the number of logical CPUs (e.g. cores, eventually times two with hyperthreading). You won't actually see this intopsince--map-by node:pe=Xbinds your MPI process to the first X CPUs on the node.