1

I have a gridded rectangular file that I have read into an array. This gridded file contains data values and NODATA values; the data values make up a continuous odd shape inside of the array, with NODATA values filling in the rest to keep the gridded file rectangular. I perform operations on the data values and skip the NODATA values.

The operations I perform on the data values consist of examining the 8 surrounding neighbors (the current cell is the center of a 3x3 grid). I can handle when any of the eight neighbors are NODATA values, but when actual data values fall in the first or last row/column, I trigger an error by trying to access an array value that doesn't exist.

To get around this I have considered three options:

  1. Add a new first and last row/column with NODATA values, and adjust my code accordingly - I can cycle through the internal 'original' array and handle the new NODATA values like the edges I'm already handling that don't fall in the first and last row/column.

  2. I can create specific processes for handling the cells in first and last row/column that have data - modified for loops (a for loop that steps through a specific sequence/range) that only examine the surrounding cells that exist, though since I still need 8 neighboring values (NODATA/non-existent cells are given the same value as the central cell) I would have to copy blank/NODATA values to a secondary 3x3 grid. Though there maybe a way to avoid the secondary grid. This solution is annoying as I have to code up specialized routines to all corner cells (4 different for loops) and any cell in the 1st or last row/column (another 4 different for loops). With a single for loop for any non-edge cell.

  3. Use a map, which based on my reading, appears capable of storing the original array while letting me search for locations outside the array without triggering an error. In this case, I still have to give these non-existent cells a value (equal to the center of the array) and so may or may not have to set up a secondary 3x3 grid as well; once again there maybe a way to avoid the secondary grid.

Solution 1 seems the simplest, solution 3 the most clever, and 2 the most annoying. Are there any solutions I'm missing? Or does one of these solutions deserve to be the clear winner?

6
  • 2
    Too much text, too little (none) code. See How do I ask a good question?. Commented Feb 10, 2015 at 6:07
  • Agree needs condensing and clarification, but maybe not code. Appears to be asking for an algorithm. Based on what I can ascertain, 1 and 2 are typical solutions. 3 would work too, but I'd probably use it in the context of 2. Commented Feb 10, 2015 at 6:09
  • 1
    It would be stupid, in my opinion, to add code to this since this question would grow by 4-5 times its current length and wouldn't answer the questions I am asking. Which is, are any of these solutions clearly better? Is there a better one I'm missing? I know how to implement any of these three solutions, but don't want to waste time implementing the two that aren't the best. Commented Feb 10, 2015 at 6:10
  • For instance, this question alludes to solution 3: stackoverflow.com/questions/256807/check-if-array-index-exists, but unless there are other benefits to using a map with std::pair<int, int> as the key, solution could be better. But, because I know so little, I figured I'd share my quandary with the wider programming world. Commented Feb 10, 2015 at 6:18
  • Sounds like the problem is your algorithm, and not out of bounds array access. Commented Feb 10, 2015 at 6:25

5 Answers 5

3

My advice is to replace all read accesses to the array by a function. For example, arr[i][j] by getarr(i,j). That way, all your algorithmic code stays more or less unchanged and you can easily return NODATA for indices outside bounds.

But I must admit that it is only my opinion.

Sign up to request clarification or add additional context in comments.

1 Comment

I like this because I could return my NODATA value if the location isn't present and then default to code I already have that sets NODATA values equal to the current cell's value. May think on this more before selecting as solution. I was really intrigued by the map option.....oh well, doesn't always make sense to use some 'cool' feature simply because I can.
2

I've had to do this before and the fastest solution was to expand the region with NODATA values and iterate over the interior. This way the core loop is simple for the compiler to optimize.

If this is not a computational hot-spot in the code, I'd go with Serge's approach instead though.

To minimize rippling effects I used an array structure with explicit row/column strides, something like this:

class Grid {
  private:
    shared_ptr<vector<double>> data;
    int origin;
    int xStride;
    int yStride;

  public:
    Grid(int nx, int ny) : 
      data( new vector<double>(nx*ny) ),
      origin(0),
      xStride(1), 
      yStride(nx) {
    }

    Grid(int nx, int ny, int padx, int pady) :
      data( new vector<double>((nx+2*padx)*(ny+2*pady));
      xStride(1), 
      yStride(nx+2*padx),
      origin(nx+3*padx) {
    }

  double& operator()(int x, int y) {
     return (*data)[origin + x*xStride + y*yStride];
  }
}

Now you can do

  Grid g(5,5,1,1);
  Grid g2(5,5);
  //Initialise
  for(int i=0; i<5; ++i) {
    for(int j=0; j<5; ++j) {
     g(i,j)=i+j;
    }
  }
  // Convolve (note we don't care about going outside the
  // range, and our indices are unchanged between the two
  // grids.
  for(int i=0; i<5; ++i) {
    for(int j=0; j<5; ++j) {
      g2(i,j)=0;
      g2(i,j)+=g(i-1,j);
      g2(i,j)+=g(i+1,j);
      g2(i,j)+=g(i,j-1);
      g2(i,j)+=g(i,j+1);
    }
  }

Aside: This data structure is awesome for working with transposes, and sub-matrices. Each of those is just an adjustment of the offset and stride values.

9 Comments

I do agree with you for the fastest way.
It is a computational hot spot; however, I may test out @SergeBallesta's solution as changes I make to this grid may ripple throughout other dependent grids, and any additional NODATA rows/columns I add will have to be removed (from all grids) later on (before output) to ensure that embedded locational coding in the header remains accurate. If the 'rippling' isn't too bad, I will test both.
The problem is a well-known problem in image processing, and this answer is the canonical solution. That said, the implementation is not impressive. static Grid* create ?! This isn't 1992. We've had constructors for decades. Similarly, new double[ ].
@MichaelAnderson, shouldn't this "origin = 2*yStride + padx;" be simply "origin = yStride + padx" in the padded grid? Thanks.
@MSalters I hear what you're saying, and agree - this was never intended as a complete drop in solution... But I've changed the code to be more resource safe and address what I think are the majority of your concerns. For a complete solution it should really contain nx and ny and do bounds checking and parameter checking in the constructor etc, but I'll leave that as an exercise for the reader.
|
1

Solution 1 is the standard solution. It takes maximum advantage of modern computer architectures, where a few bytes of memory are no big deal, and correct instruction prediction accelerates performance. As you keep accessing memory in a predictable pattern (with fixed strides), the CPU prefetcher will successfully read ahead.

Solution 2 saves a small amount of memory, but the special handling of the edges incurs a real slowdown. Still, the large chunk in the middle benefits from the prefetcher.

Solution 3 is horrible. Map access is O(log N) instead of O(1), and in practice it can be 10-20 times slower. Maps have poor locality of reference; the CPU prefetcher will not kick in.

2 Comments

Do the benefits of solution 1 still exist if instead of adding a buffer the array values are accessed through a function as @SergeBallesta suggests?
@traggatmot: The branch in the access function slows the code down, although the branch predictor will predict it reasonably well. However, even a correct prediction has non-zero costs.
0

If simple means "easy to read" I'd recommend you declare a class with an overloaded [] operator. Use it like a regular array but it'll have bounds checking to handle NODATA.

If simple means "high performance" and you have sparse grid with isolated DATA consider implementing linked lists to the DATA values and implement optimal operators that go directly to tge DATA values.

Comments

0

1 wastes memory proportional to your overall rectangle size, 3/maps are clumsy here, 2 is actually very easy to do:

T d[X][Y] = ...;

for (int x = 0; x < X; ++x)
    for (int y = 0; y < Y; ++y)  // move over d[x][y] centres
    {
        T r[3][3] = { { d[i,j], d[i,j], d[i,j] },
                        d[i,j], d[i,j], d[i,j] },
                        d[i,j], d[i,j], d[i,j] } };

        for (int i = std::min(0, x-1); i < std::max(X-1, x+1); ++i)
            for (int j = std::min(0, y-1); j < std::max(Y-1, y+1); ++j)
                if (d[i][j] != NoData)
                    r[i-x][j-y] = d[i][j];

        // use r for whatever...
    }

Note that I'm using signed int very deliberately so x-1 and y-1 don't become huge positive numbers (as they would with say size_t) and break the std::min logic... but you could express it differently if you had some reason to prefer size_t (e.g. x == 0 ? 0 : x - 1).

3 Comments

This will take me a little while to digest. As of right now, I have never written any code this complicated and so will probably stay away from this solution for the time being til I fully understand it.
@traggatmot: sure - no worries. You can maybe copy/paste it into a small program, flesh it out, step through it in a debugger to see what it's doing....
@traggatmot: actually noticed a bug left when I'd switched approaches while writing the answer, so had to hack this around further.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.