2

I try to compute mode on all cells of the same zone (same value) on a numpy array. I give you an example of code below. In this example sequential approach works fine but multiprocessed approach does nothing. I do not find my mistake.

Does someone see my error ?

I would like to parallelize the computation because my real array is a 10k * 10k array with 1M zones.

import numpy as np
import scipy.stats as ss
import multiprocessing as mp

def zone_mode(i, a, b, output):
    to_extract = np.where(a == i)
    val = b[to_extract]
    output[to_extract] = ss.mode(val)[0][0]
    return output

def zone_mode0(i, a, b):
    to_extract = np.where(a == i)
    val = b[to_extract]
    output = ss.mode(val)[0][0]
    return output

np.random.seed(1)

zone = np.array([[1, 1, 1, 2, 3],
                 [1, 1, 2, 2, 3],
                 [4, 2, 2, 3, 3],
                 [4, 4, 5, 5, 3],
                 [4, 6, 6, 5, 5],
                 [6, 6, 6, 5, 5]])
values = np.random.randint(8, size=zone.shape)

output = np.zeros_like(zone).astype(np.float)

for i in np.unique(zone):
    output = zone_mode(i, zone, values, output)

# for multiprocessing    
zone0 = zone - 1

pool = mp.Pool(mp.cpu_count() - 1)
results = [pool.apply(zone_mode0, args=(u, zone0, values)) for u in np.unique(zone0)]
pool.close()
output = results[zone0]
4
  • what is the range for your values? Commented Sep 30, 2019 at 20:55
  • the range for zone is 1 to 1 460 548 with missing values in. But I can update the range from 0 to 1 020 089. The range for values can varry accordig to my case study : [1, 2, 3, 5, 7, 8, 9, 61] or [111, 112, 212, 213, 411, 311, 312, 313, ...] Commented Oct 1, 2019 at 5:52
  • @user7017404 Any feedback on the posted solution? Commented Oct 1, 2019 at 6:34
  • I am preparing my data to test it. I come back for the feedback asap. Thank you for your help. Commented Oct 1, 2019 at 6:39

1 Answer 1

1

For positve integers in the arrays - zone and values, we can use np.bincount. The basic idea is that we will consider zone and values as row and cols on a 2D grid. So, can map those to their linear index equivalent numbers. Those would be used as bins for binned summation with np.bincount. Their argmax IDs would be the mode numbers. They are mapped back to zone-grid with indexing into zone.

Hence, the solution would be -

m = zone.max()+1
n = values.max()+1
ids = zone*n + values
c = np.bincount(ids.ravel(),minlength=m*n).reshape(-1,n).argmax(1)
out = c[zone]

For sparsey data (well spread integers in the input arrays), we can look into sparse-matrix to get the argmax IDs c. Hence, with SciPy's sparse-matrix -

from scipy.sparse import coo_matrix

data = np.ones(zone.size,dtype=int)
r,c = zone.ravel(),values.ravel()
c = coo_matrix((data,(r,c))).argmax(1).A1

For slight perf. boost, specify the shape -

c = coo_matrix((data,(r,c)),shape=(m,n)).argmax(1).A1

Solving for generic values

We will make use of pandas.factorize, like so -

import pandas as pd

ids,unq = pd.factorize(values.flat)
v = ids.reshape(values.shape)
# .. same steps as earlier with bincount, using v in place of values
out = unq[c[zone]]

Note that for tie-cases, it would pick random element off values. If you want to pick the first one, use pd.factorize(values.flat, sort=True).

Sign up to request clarification or add additional context in comments.

1 Comment

Your solution (with bincount) is beyond my expectation. Less than 10s on my data. Thank you so much.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.