2

I have a 3d numpy array fill with integers from 1 to 7.
I want to count the number of unique elements in neighbour cells of each cell. For exmaple, in a 2d array:

a=[[1,1,1,7,4],
   [1,1,1,3,2],
   [1,1,1,2,2],
   [1,3,1,4,2],
   [1,1,1,4,2]]  

would yield a result of:

[[1,1,2,3,2],
 [1,1,2,3,3],
 [1,2,2,4,1],
 [2,1,3,3,2],
 [1,2,2,3,2]]  

I am currently going through every cell in the array and checking its neighbour one-by-one.

temp = np.zeros(6)
if (x>0):
    temp[0] = model[x-1,y,z]
if (x<x_len-1):
    temp[1] = model[x+1,y,z]
if (y>0):
    temp[2] = model[x,y-1,z]
if (y<y_len-1):
    temp[3] = model[x,y+1,z]
if (z>0):
    temp[4] = model[x,y,z-1]
if (z<z_len-1):
    temp[5] = model[x,y,z+1]
result[x,y,z] = np.count_nonzero(np.unique(temp))  

I found this is quite slow and inefficient. Is there a more efficient/quicker way to do this?

Thanks.

4
  • This is something that is rather awkward to vectorise, but not impossible. How big is your array? Commented Aug 6, 2014 at 21:04
  • 1
    it is a 384*384*100 array Commented Aug 6, 2014 at 21:22
  • My answer below is vectorized (but not necessarily very efficient), it takes approximately 2.5 seconds with my machine. The algorithm can naturally be made parallel quite efficiently. (Almost any algorithm solving this problem can.) Commented Aug 6, 2014 at 21:38
  • I checked, and yes, these solutions scale to your problem. See the example below, which is also slightly more efficient than the first example I gave. Commented Aug 6, 2014 at 21:47

2 Answers 2

2

Well, there might be a way:

  • create 6 offset arrays (left, right, up, down, front, back)
  • combine these arrays into a (R-2, C-2, D-2, 6) 4D array
  • sort the 4D array by the last dimension (the dimension with size 6)

Now you have a 4D array where you can pick a sorted vector of neighbours for each cell. After that you may count the different neighbours by:

  • use diff to the 4th axis (the sorted array)
  • calculate the sum of non-zero differences along the 4th axis

This will give you the number of different neighbours - 1.

The first part is probably rather clear. If a cell has neighbours (1, 2, 4, 2, 2, 3), the neighbour vector is sorted into (1, 2, 2, 2, 3, 4). The difference vector is then (1, 0, 0, 1, 1), and the sum of non-zero elements ((diff(v) != 0).sum(axis=4)) gives 3. So, there are 4 unique neighbours.

Of course, this method does not take the edges into account. That you can solve by padding the initial array by 1 cell into each direction by numpy.pad with mode reflect. (That mode is actually the only one that is guaranteed not to introduce any new values into the neighbourhood, try it with a two-dimensional array to understand why.)

For example:

import numpy as np

# create some fictional data
dat = np.random.randint(1, 8, (6, 7, 8))

# pad the data by 1
datp = np.pad(dat, 1, mode='reflect')

# create the neighbouring 4D array
neigh = np.concatenate((
    datp[2:,1:-1,1:-1,None], datp[:-2,1:-1,1:-1,None], 
    datp[1:-1,2:,1:-1,None], datp[1:-1,:-2,1:-1,None],
    datp[1:-1,1:-1,2:,None], datp[1:-1,1:-1,:-2,None]), axis=3)

# sort the 4D array
neigh.sort(axis=3)

# calculate the number of unique samples
usamples = (diff(neigh, axis=3) != 0).sum(axis=3) + 1

The solution above is quite universal, it works with anything sortable. However, it consumes a lot of memory (6 copies of the array) and is not a high-performance solution. If we are satisfied with a solution that only works with this special case (values are very small integers), we can do some bit magic.

  • create an array where every item is represented as a bit mask (1 = 00000001, 2 = 00000010, 3 = 00000100, etc.)
  • OR the neighbouring arrays together
  • count the number of bits in the ORed result by using a look-up table

.

import numpy as np

# create a "number of ones" lookup table
no_ones = np.array([bin(i).count("1") for i in range(256)], dtype='uint8')

# create some fictional data
dat = np.random.randint(1, 8, (6, 7, 8))

# create a bit mask of the cells
datb = 1 << dat.astype('uint8')

# pad the data by 1
datb = np.pad(datb, 1, mode='reflect')

# or the padded data together
ored = (datb[ 2:, 1:-1, 1:-1] |
        datb[:-2, 1:-1, 1:-1] |
        datb[1:-1,  2:, 1:-1] |
        datb[1:-1, :-2, 1:-1] |
        datb[1:-1, 1:-1,  2:] |
        datb[1:-1, 1:-1, :-2])

# get the number of neighbours from the LUT
usamples = no_ones[ored]

The performance impact is rather significant. The first version takes 2.57 s and the second version 283 ms on my machine with a 384 x 384 x 100 table (excluding creating the random data). This translates into 19 ns and 174 ns/cell, respectively.

This solution is however limited to the case where there is a reasonable number of different (and known) values. If the number of different possible values grows above 64, the bit magic loses its charm. (Also, at around 20 different values the look-up part has to be split into more than one operation do to the memory consumption of the LUT. The LUT should fit into the CPU cache, otherwise it becomes slow.)

On the other hand, expanding the solution to use the full 26-neighbourhood is simple and quite fast.

Sign up to request clarification or add additional context in comments.

3 Comments

I like the use of diff on the sorted array!
1 << dat.astype('uint8') is awesome. If you make the neighbor pattern more generic you have the perfect answer. But +1 already!
that's a brilliant (and beautiful) solution. Thanks :)
1

You could try the following, not necessarily optimal and will cause problems if your data are too large, but here goes

import numpy as np
from sklearn.feature_extraction.image import extract_patches

a = np.array([[1,1,1,7,4],
              [1,1,1,3,2],
              [1,1,1,2,2],
              [1,3,1,4,2],
              [1,1,1,4,2]])

patches = extract_patches(a, patch_shape=(3, 3), extraction_step=(1, 1))

neighbor_template = np.array([[0, 1, 0],
                              [1, 0, 1],
                              [0, 1, 0]]).astype(np.bool)
centers = patches[:, :, 1, 1]
neighbors = patches[:, :, neighbor_template]

possible_values = np.arange(1, 8)
counts = (neighbors[..., np.newaxis] ==
          possible_values[np.newaxis, np.newaxis, np.newaxis]).sum(2)

nonzero_counts = counts > 0
unique_counter = nonzero_counts.sum(-1)

print unique_counter

yields

[[1 2 3]
 [2 2 4]
 [1 3 3]]

Which is the middle of the array you are expecting as a result. In order to obtain the full array with borders, the borders would need to be treated separately. With numpy 1.8 you can use np.pad with mode median reflect to pad with one pixel. This would also complete the border correctly.

Now let's move to 3D and make sure we don't use too much memory.

# first we generate a neighbors template
from scipy.ndimage import generate_binary_structure

neighbors = generate_binary_structure(3, 1)
neighbors[1, 1, 1] = False
neighbor_coords = np.array(np.where(neighbors)).T

data = np.random.randint(1, 8, (384, 384, 100))
data_neighbors = np.zeros((neighbors.sum(),) + tuple(np.array(data.shape) - 2), dtype=np.uint8)

# extract_patches only generates a strided view
data_view = extract_patches(data, patch_shape=(3, 3, 3), extraction_step=(1, 1, 1))

for neigh_coord, data_neigh in zip(neighbor_coords, data_neighbors):
    sl = [slice(None)] * 3 + list(neigh_coord)
    data_neigh[:] = data_view[sl]

indicator = (data_neigh[np.newaxis] == possible_values[:, np.newaxis, np.newaxis, np.newaxis]).sum(1) > 0

uniques = indicator.sum(0)

As before, you find the number of unique entries in uniques. Using methods like generate_binary_structure from scipy and the sliding window from extract_patches makes this approach general: If you wanted a 26-neighborhood instead of a 6-neighborhood, then you would only have to change generate_binary_structure(3, 1) to generate_binary_structure(3, 2). It also generalized straightforwardly to extra dimensions, provided the amount of data generated fit in the memory of your machine.

8 Comments

Pad with mode='median'is not safe because it may introduce new elements into the neighbourhood. The pad modes are actually rather logical once you grasp the concept of where the median is calculated from. In this case reflect works, the others do not. Even symmetric is bad, but I needed to check that one with a 2D array to see the difference...
I realized that 'median' doesn't work just after submitting it :). Why does reflect work? Because it doesn't repeat the border and symmetric does?
unfortunately np.bincount only works in 1D. This would be a perfect application if it generalized.
Exactly. The difference between reflect and symmetric is that symmetric repeats the edge points. For example, reflect reflects column 1 to the left hand side (new column) of column 0. This is very lucky, because then the neighbour on the right hand side gets reflected onto the left hand side, and thus no unique values are introduced.
With bincount you would have a bin vector and you'd need to count the nonzeros in that. Also, bincount is extremely efficient with large data (millions or billions of points), not so useful with small data (here 6 points). There is a reason why bincount is a 1D function.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.