7

I have a large array of primitive value-types. The array is in fact one dimentional, but logically represents a 2-dimensional field. As you read from left to right, the values need to become (the original value of the current cell) + (the result calculated in the cell to the left). Obviously with the exception of the first element of each row which is just the original value.

I already have an implementation which accomplishes this, but is entirely iterative over the entire array and is extremely slow for large (1M+ elements) arrays.

Given the following example array,

0 0 1 0 0
2 0 0 0 3
0 4 1 1 0
0 1 0 4 1

Becomes

0 0 1 1 1
2 2 2 2 5
0 4 5 6 6
0 1 1 5 6

And so forth to the right, up to problematic sizes (1024x1024)

The array needs to be updated (ideally), but another array can be used if necessary. Memory footprint isn't much of an issue here, but performance is critical as these arrays have millions of elements and must be processed hundreds of times per second.

The individual cell calculations do not appear to be parallelizable given their dependence on values starting from the left, so GPU acceleration seems impossible. I have investigated PLINQ but requisite for indices makes it very difficult to implement.

Is there another way to structure the data to make it faster to process?

If efficient GPU processing is feasible using an innovative teqnique, this would be vastly preferable, as this is currently texture data which is having to be pulled from and pushed back to the video card.

2
  • The output more precisely describes what I intend to do. it is a proceedural calculation within a row, but the whole rows (ranges of elements in the larger array) can be parallelized. Commented Dec 10, 2014 at 18:53
  • I have added an additional element to each row in the example. As you read from left to right, you just add up what the original value of the current cell is with the result you calculated in the cell to the left. Commented Dec 10, 2014 at 18:56

4 Answers 4

3
+50

Proper coding and a bit of insight in how .NET knows stuff helps as well :-)

Some rules of thumb that apply in this case:

  1. If you can hint the JIT that the indexing will never get out of bounds of the array, it will remove the extra branche.
  2. You should vectorize it only in multiple threads if it's really slow (f.ex. >1 second). Otherwise task switching, cache flushes etc will probably just eat up the added speed and you'll end up worse.
  3. If possible, make memory access predictable, even sequential. If you need another array, so be it - if not, prefer that.
  4. Use as few IL instructions as possible if you want speed. Generally this seems to work.
  5. Test multiple iterations. A single iteration might not be good enough.

using these rules, you can make a small test case as follows. Note that I've upped the stakes to 4Kx4K since 1K is just so fast you cannot measure it :-)

public static void Main(string[] args)
{
    int width = 4096;
    int height = 4096;

    int[] ar = new int[width * height];
    Random rnd = new Random(213);
    for (int i = 0; i < ar.Length; ++i)
    {
        ar[i] = rnd.Next(0, 120);
    }

    // (5)...
    for (int j = 0; j < 10; ++j)
    {
        Stopwatch sw = Stopwatch.StartNew();

        int sum = 0;
        for (int i = 0; i < ar.Length; ++i)  // (3) sequential access
        {
            if ((i % width) == 0)
            {
                sum = 0;
            }

            // (1) --> the JIT will notice this won't go out of bounds because [0<=i<ar.Length]
            // (5) --> '+=' is an expression generating a 'dup'; this creates less IL.
            ar[i] = (sum += ar[i]); 
        }

        Console.WriteLine("This took {0:0.0000}s", sw.Elapsed.TotalSeconds);
    }
    Console.ReadLine();
}

One of these iterations wil take roughly 0.0174 sec here, and since this is about 16x the worst case scenario you describe, I suppose your performance problem is solved.

If you really want to parallize it to make it faster, I suppose that is possible, even though you will loose some of the optimizations in the JIT (Specifically: (1)). However, if you have a multi-core system like most people, the benefits might outweight these:

for (int j = 0; j < 10; ++j)
{
    Stopwatch sw = Stopwatch.StartNew();
    Parallel.For(0, height, (a) =>
    {
        int sum = 0;
        for (var i = width * a + 1; i < width * (a + 1); i++)
        {
            ar[i] = (sum += ar[i]);
        }
    });
    Console.WriteLine("This took {0:0.0000}s", sw.Elapsed.TotalSeconds);
}

If you really, really need performance, you can compile it to C++ and use P/Invoke. Even if you don't use the GPU, I suppose the SSE/AVX instructions might already give you a significant performance boost that you won't get with .NET/C#. Also I'd like to point out that the Intel C++ compiler can automatically vectorize your code - even to Xeon PHI's. Without a lot of effort, this might give you a nice boost in performance.

Sign up to request clarification or add additional context in comments.

2 Comments

I especially like your second point. Ultimately the length of the row will define the worth, good or bad, of parallelization.
Thanks. On that note, I usually don't even bother for a few million integer operations... After all, making your application single threaded has the nice feature that the rest of your system stays more responsive. :-)
2

Well, I don't know too much about GPU, but I see no reason why you can't parallelize it as the dependencies are only from left to right.

There are no dependencies between rows.

0 0 1 0 0  - process on core1 |
2 0 0 0 3  - process on core1 |
-------------------------------
0 4 1 1 0  - process on core2 |
0 1 0 4 1  - process on core2 |

Although the above statement is not completely true. There's still hidden dependencies between rows when it comes to memory cache.

It's possible that there's going to be cache trashing. You can read about "cache false sharing", in order to understand the problem, and see how to overcome that.

Comments

0

As @Chris Eelmaa told you it is possible to do a parallel execution by row. Using Parallel.For could be rewritten like this:

static int[,] values = new int[,]{
    {0, 0, 1, 0, 0},
    {2, 0, 0, 0, 3},
    {0, 4, 1, 1, 0},
    {0, 1, 0, 4 ,1}};
static void Main(string[] args)
{
    int rows=values.GetLength(0);
    int columns=values.GetLength(1);
    Parallel.For(0, rows, (row) =>
    {
        for (var column = 1; column < columns; column++)
        {
            values[row, column] += values[row, column - 1];
        }
    });

    for (var row = 0; row < rows; row++)
    {
        for (var column = 0; column < columns; column++)
        {
            Console.Write("{0} ", values[row, column]);
        }
        Console.WriteLine();
    }

So, as stated in your question, you have a one dimensional array, the code would be a bit faster:

static void Main(string[] args)
{
    var values = new int[1024 * 1024];
    Random r = new Random();
    for (int i = 0; i < 1024; i++)
    {
        for (int j = 0; j < 1024; j++)
        {
            values[i * 1024 + j] = r.Next(25);
        }
    }

    int rows = 1024;
    int columns = 1024;

    Stopwatch sw = Stopwatch.StartNew();
    for (int i = 0; i < 100; i++)
    {
        Parallel.For(0, rows, (row) =>
        {
            for (var column = 1; column < columns; column++)
            {
                values[(row * columns) + column] += values[(row * columns) + column - 1];
            }
        });
    }

    Console.WriteLine(sw.Elapsed);
}

But not as fast as a GPU. To use parallel GPU processing you will have to rewrite it in C++ AMP or take a look on how to port this parallel for to cudafy: http://w8isms.blogspot.com.es/2012/09/cudafy-me-part-3-of-4.html

Comments

0

You may as well store the array as a jagged array, the memory layout will be the same. So, instead of,

int[] texture;

you have,

int[][] texture;

Isolate the row operation as,

private static Task ProcessRow(int[] row)
{
    var v = row[0];
    for (var i = 1; i < row.Length; i++)
    {
        v = row[i] += v;
    }

    return Task.FromResult(true);
}

then you can write a function that does,

Task.WhenAll(texture.Select(ProcessRow)).Wait();

If you want to remain with a 1-dimensional array, a similar approach will work, just change ProcessRow.

private static Task ProcessRow(int[] texture, int start, int limit)
{
    var v = texture[start];
    for (var i = start + 1; i < limit; i++)
    {
        v = texture[i] += v;
    }

    return Task.FromResult(true);
}

then once,

var rowSize = 1024;
var rows =
    Enumerable.Range(0, texture.Length / rowSize)
    .Select(i => Tuple.Create(i * rowSize, (i * rowSize) + rowSize))
    .ToArray();

then on each cycle.

Task.WhenAll(rows.Select(t => ProcessRow(texture, t.Item1, t.Item2)).Wait();

Either way, each row is processed in parallel.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.