Below is a parallel task model using omp. The proc1, ..., proc4 each costs below 1ms to finish. And in develop environment(2 physical CPU, 16 core in total) the job finishes perfectly.
But in production environment(same machine, about half core is full throttle 100% in use), the code runs into big latency trouble taking more than 10ms or so constantly.
It is definitly not a occasional phenomenon as the main program calls this code 100x times and everytime it takes more than 10ms to finish.
const size_t ALIGNMENT = 8; // Align to 8 bytes
const size_t SIZE = num_feature * sizeof(union Entry);
union Entry* inst = aligned_alloc(ALIGNMENT, SIZE);
memset(inst, 0, SIZE); // Clear memory
omp_set_num_threads(4);
#pragma omp parallel
{
/* init timer */
#pragma omp single
{
#pragma omp task
score1 = proc1(inst, 0); /* proc1 timer */
#pragma omp task
score2 = proc2(inst, 0); /* proc2 timer */
#pragma omp task
score3 = proc3(inst, 0); /* proc3 timer */
#pragma omp task
score4 = proc4(inst, 0); /* proc4 timer */
}
} /* hanging for no reason?
free(inst);
/* total timer */
I did thorough logging and it shows me that the proc1, ..., proc4 finished on time using 1ms each, but the program hangs at the end of parallel region and maybe having trouble for the threads syncing at the barrier.
Many many thanks, I am in total lost but to use the old sequential computing... here is some logging,
in production environment:
init time(ms): 0.016000
init from thread 0, pid 1038162, thread id 139639529867008, cpu 4
proc1 from thread 3, pid 1038162, thread id 139638764283648, cpu 12
proc2 from thread 4, pid 1038162, thread id 139638755890944, cpu 14
proc3 from thread 2, pid 1038162, thread id 139638772676352, cpu 13
proc4 from thread 15, pid 1038162, thread id 139637775398656, cpu 2
proc4 time(ms): 0.383000
proc3 time(ms): 0.501000
proc2 time(ms): 0.513000
proc1 time(ms): 0.582000
total time(ms): 11.867000
in develop environment:
inst time(ms): 0.004000
init from thread 0, pid 1059897, tid 140013452008448, cpu 5
proc1 from thread 2, pid 1059897, tid 140013211875072, cpu 3
proc2 from thread 3, pid 1059897, tid 140013203482368, cpu 0
proc4 from thread 0, pid 1059897, tid 140013452008448, cpu 5
proc3 from thread 1, pid 1059897, tid 140013220267776, cpu 10
proc4 time(ms): 0.095000
proc3 time(ms): 0.107000
proc2 time(ms): 0.115000
proc1 time(ms): 0.112000
total time(ms): 0.163000