parallelize zonal computation on numpy array

Question

I try to compute mode on all cells of the same zone (same value) on a numpy array. I give you an example of code below. In this example sequential approach works fine but multiprocessed approach does nothing. I do not find my mistake.

Does someone see my error ?

I would like to parallelize the computation because my real array is a 10k * 10k array with 1M zones.

import numpy as np
import scipy.stats as ss
import multiprocessing as mp

def zone_mode(i, a, b, output):
    to_extract = np.where(a == i)
    val = b[to_extract]
    output[to_extract] = ss.mode(val)[0][0]
    return output

def zone_mode0(i, a, b):
    to_extract = np.where(a == i)
    val = b[to_extract]
    output = ss.mode(val)[0][0]
    return output

np.random.seed(1)

zone = np.array([[1, 1, 1, 2, 3],
                 [1, 1, 2, 2, 3],
                 [4, 2, 2, 3, 3],
                 [4, 4, 5, 5, 3],
                 [4, 6, 6, 5, 5],
                 [6, 6, 6, 5, 5]])
values = np.random.randint(8, size=zone.shape)

output = np.zeros_like(zone).astype(np.float)

for i in np.unique(zone):
    output = zone_mode(i, zone, values, output)

# for multiprocessing    
zone0 = zone - 1

pool = mp.Pool(mp.cpu_count() - 1)
results = [pool.apply(zone_mode0, args=(u, zone0, values)) for u in np.unique(zone0)]
pool.close()
output = results[zone0]

the range for zone is 1 to 1 460 548 with missing values in. But I can update the range from 0 to 1 020 089. The range for values can varry accordig to my case study : [1, 2, 3, 5, 7, 8, 9, 61] or [111, 112, 212, 213, 411, 311, 312, 313, ...] — user7017404
– user7017404, Commented Oct 1, 2019 at 5:52
I am preparing my data to test it. I come back for the feedback asap. Thank you for your help. — user7017404
– user7017404, Commented Oct 1, 2019 at 6:39

Divakar · Accepted Answer · 2019-09-30 21:28:30Z

1

For positve integers in the arrays - zone and values, we can use np.bincount. The basic idea is that we will consider zone and values as row and cols on a 2D grid. So, can map those to their linear index equivalent numbers. Those would be used as bins for binned summation with np.bincount. Their argmax IDs would be the mode numbers. They are mapped back to zone-grid with indexing into zone.

Hence, the solution would be -

m = zone.max()+1
n = values.max()+1
ids = zone*n + values
c = np.bincount(ids.ravel(),minlength=m*n).reshape(-1,n).argmax(1)
out = c[zone]

For sparsey data (well spread integers in the input arrays), we can look into sparse-matrix to get the argmax IDs c. Hence, with SciPy's sparse-matrix -

from scipy.sparse import coo_matrix

data = np.ones(zone.size,dtype=int)
r,c = zone.ravel(),values.ravel()
c = coo_matrix((data,(r,c))).argmax(1).A1

For slight perf. boost, specify the shape -

c = coo_matrix((data,(r,c)),shape=(m,n)).argmax(1).A1

Solving for generic values

We will make use of pandas.factorize, like so -

import pandas as pd

ids,unq = pd.factorize(values.flat)
v = ids.reshape(values.shape)
# .. same steps as earlier with bincount, using v in place of values
out = unq[c[zone]]

Note that for tie-cases, it would pick random element off values. If you want to pick the first one, use pd.factorize(values.flat, sort=True).

edited Sep 30, 2019 at 21:28

answered Sep 30, 2019 at 20:38

Divakar

222k19 gold badges273 silver badges374 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user7017404 Over a year ago

Your solution (with bincount) is beyond my expectation. Less than 10s on my data. Thank you so much.

Collectives™ on Stack Overflow

parallelize zonal computation on numpy array

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related