Nested loop division

Question

I need to divide a nested loop evenly. Now I know how to do this if i and j roughly even

 mystart = (n / numproc) * myid;
  if (n % numproc > myid){
    mystart += myid;
    myend = mystart + (n / numproc) + 1;
  }else{
    mystart += n % numproc;
    myend = mystart + (n / numproc);

  for(i=mystart; i<myend; ++i)

But I need to calculate distances between each particle so the loops look like this:

* * * * * * * * * * * * * 
* * * * * * * * * * * * *
* * * * * * * * * * * *
* * * * * * * * * * *
* * * * * * * * * *
* * * * * * * * *
* * * * * * * *
* * * * * * *
* * * * * *
* * * * *
* * * *
* * * 
* * 
*

But bigger. Each * is an operation. So if I were to use that previous algorithm then last processes would finish their work while the first processes wouldn't even be halfway there, which isn't efficient. Is there some kind of algorithm that I'd be able to use with MPI?

I thought that there may be a way to assign each iteration to the next process, like this: (i - iteration, p - process (out of 3))

 i1 -> p1, i2 -> p2, i3 -> p3, i4 -> p1, i5 -> p2 (and so on)

But I'm not sure if it'd work and how to implement this. If you guys have some ideas then I'd love to see them with an example of usage in MPI.

EDIT: @Zulan It'd be easier to explain with code. My code generates random particles (x,y,z) inside a box. All the information abut those particles are known only to process rank==0. I have to implement methods (links to code below) using MPI to make those computations faster.

I have to find a pair of particles with the smallest distance betweeen them (so n*(n-1)) ops ) the remove that pair, and I have to do this NumberofParticlestoRemove times. This method is my main concern though, since I'm not sure how to divide the loop.

Here's the source code: http://pastebin.com/XxBGd50j And here's my attempt (it's unfinished) : http://pastebin.com/Yd8DwwfX

After computing the distance I figured I'd use MPI_Reduce to get the min distance, but I don't know how to tie that pair positions to the minimal distance value (there are three vectors x,y,z, so if the closest distance would be between particle ex. 5 and 10 it'd be x[5],y[5],z[5]. x[10],y[10],z[10], So even though it'd be able to extract the min distance I don't know how to save the numbers 5 and 10, to later eliminate those particles. I say eliminate but I have to create a new particle out of that pair that is positioned exactly at the middle of their disntace (function in Helper).

How is the data distributed you are working on? How expensive is each operation? — Zulan
– Zulan, Commented Feb 16, 2016 at 9:36
@Zulan I added an edit, that explains my problem well, and added the source code — adammayer
– adammayer, Commented Feb 16, 2016 at 21:21

Zulan · Accepted Answer · 2016-02-16 23:04:54Z

You are approaching this from the algorithm side of a very naive algorithm. For a scalable algorithm you never want to do O(n^2) operations.

Let's think about data instead. How to distribute the data? This is a case for 3D domain decomposition. For simplicity I am making up some numbers. You divide your space in 100*100*100 cells. You then have 1000 MPI processes, each of them gets assigned 10*10*10 block of cells. Each MPI rank will have the particles of it's own cells plus a halo of one in each direction, adding 10*10*6 + 10*12 + 8 cells¹².

How does you then compute the minimum distance? Each rank computes it locally for the particles within it's own cells and it's halo. The minimum distance is then the minimum of all local minimum distances (MPI_Reduce).

Why is it correct? This is only correct if the minimal distance is smaller than the edge of a cell. That you can guarantee by choosing the size of a cell such that even with ideal packing the minimal distance would be smaller than the cell width. Then it will always be smaller than the cell width.

Good news, this is a known problem, you don't have to a quadratic amount of operations locally.

Now the interesting question is: What if my the particles are not evenly distributed? Then you will get load imbalances: some ranks will have more work to do than others. This opens up the whole field of research. You can do load balancing on the regular grid by handing out different numbers of cells to each rank. Or you can even go for unstructured/adaptive meshes. There are many papers and software libraries that can help you there.

Now this is a more complex answer than what would be the answer to your direct question. You could of course quite easily parallelize the brute-force algorithm - a very common thing for inefficient algorithms. Just naively compute each row on a new rank, given enough particles the balance will be asymptotically perfect.

for (particle_a_index = my_rank; particle_a_index < num_particles; particle_a_index += num_procs) {
    for (particle_b_index = particle_a_index + 1; particle_b_index < num_particles; particle_b_index++) {
        ...

But instead of doing that, better implement an efficient serial algorithm.

¹ Except for boundaries

² You actually only need half of the halo for your specific problem.

Collectives™ on Stack Overflow

Nested loop division

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related