Most frequent occurrence (mode) of numpy array values based on IDs in another array

Question

I have a 2-D array containing values and I would like to calculate the most frequent entry (i.e., the mode) from this data according to IDs in a second array.

data = np.array([[[ 0, 10, 50, 80, 80],
                  [10, 10, 50, 80, 90],
                  [10, 10, 50, 80, 90],
                  [50, 50, 80, 80, 80]])


ID = np.array([[[ 1,  1, 2, 3, 3],
                  [1, 1, 2, 3, 3],
                  [1, 1, 2, 3, 3],
                  [1, 2, 2, 2, 3]])


#Expected Result is:

[10 50 80]

The most frequent value in data array for ID=1 is 10, ID=2 is 50 and ID=3 is 80. I've been playing around with np.unique and combinations of np.bincount and np.argmax but I can't figure out how to get the result. Any help?

javidcf · Accepted Answer · 2020-01-24 16:53:08Z

This is one possible vectorized way to do it, if you have integer data and the number of different values and groups is not too huge.

import numpy as np

# Input data
data = np.array([[[ 0, 10, 50, 80, 80],
                  [10, 10, 50, 80, 90],
                  [10, 10, 50, 80, 90],
                  [50, 50, 80, 80, 80]]])
ID = np.array([[[1, 1, 2, 3, 3],
                [1, 1, 2, 3, 3],
                [1, 1, 2, 3, 3],
                [1, 2, 2, 2, 3]]])
# Find unique data values and group ids with reverse indexing
data_uniq, data_idx = np.unique(data, return_inverse=True)
id_uniq, id_idx = np.unique(ID, return_inverse=True)
# Number of unique data values
n = len(data_uniq)
# Number of ids
m = len(id_uniq)
# Change indices so values of each group are within separate intervals
grouped = data_idx + (n * np.arange(m))[id_idx]
# Count repetitions and reshape
# counts[i, j] has the number of apparitions of the j-th value in the i-th group
counts = np.bincount(grouped, minlength=n * m).reshape(m, n)
# Get the modes from the counts
modes = data_uniq[counts.argmax(1)]
# Print result
for group, mode in zip(id_uniq, modes):
    print(f'Mode of {group}: {mode}')

Output:

Mode of 1: 10
Mode of 2: 50
Mode of 3: 80

A quick benchmark for a particular problem size:

import numpy as np
import scipy.stats

def find_group_modes_loop(data, ID):
    # Assume ids are given sequentially starting from 1
    m = ID.max()
    modes = np.empty(m, dtype=data.dtype)
    for id in range(m):
        modes[id] = scipy.stats.mode(data[ID == id + 1])[0][0]
    return modes

def find_group_modes_vec(data, ID):
    # Assume ids are given sequentially starting from 1
    data_uniq, data_idx = np.unique(data, return_inverse=True)
    id_uniq = np.arange(ID.max(), dtype=data.dtype)
    n = len(data_uniq)
    m = len(id_uniq)
    grouped = data_idx + (n * np.arange(m))[ID.ravel() - 1]
    counts = np.bincount(grouped, minlength=n * m).reshape(m, n)
    return data_uniq[counts.argmax(1)]

# Make data
np.random.seed(0)
data = np.random.randint(0, 1_000, size=10_000_000)
ID = np.random.randint(1, 100, size=10_000_000)
print(np.all(find_group_modes_loop(data, ID) == find_group_modes_vec(data, ID)))
# True
%timeit find_group_modes_loop(data, ID)
# 212 ms ± 647 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit find_group_modes_vec(data, ID)
# 122 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So at least for some cases the vectorized solution can be significantly faster than looping.

I like this vectorized approach best as it runs quicker and logically fits with my code. Thanks so much!

Kenan · Accepted Answer · 2020-01-24 15:55:16Z

2

One approach is to use scipy mode

from scipy.stats import mode

uniq_ids = np.unique(ID)
modes = []

for id in uniq_ids:
    modes.append(mode(data[ID == id])[0][0])

modes

[10 50 80]

answered Jan 24, 2020 at 15:55

Kenan

14.2k9 gold badges47 silver badges56 bronze badges

Comments

Shubham Shaswat · Accepted Answer · 2020-01-24 16:22:10Z

2

I have applied this approach in numpy, I hope this will solve your issue.

n,f=np.unique(data[np.where(ID == 1)],return_counts=True)

Output: (array([ 0, 10, 50]), array([1, 5, 1]))

The output is tuple of the values and their respective frequencies

You could get value with maximum frequencies like this

n[np.argmax(f)]

The proper solution will be:

res = [] for id in np.unique(ID): n,f = np.unique(data[np.where(ID == id)],return_counts=True) res.append(n[np.argmax(f)])

edited Jan 24, 2020 at 16:22

answered Jan 24, 2020 at 16:16

Shubham Shaswat

1,3109 silver badges14 bronze badges

Comments

Mad Physicist · Accepted Answer · 2020-01-24 16:28:06Z

1

If you want a pure numpy solution, you can reinvent the wheel in @Kenan's loop:

def mode(x):
    n, c = np.unique(x, return_counts=True)
    return n[np.argmax(c)]

modes = [mode(data[ID == id]) for id in np.unique(IDs)]

answered Jan 24, 2020 at 16:28

Mad Physicist

116k29 gold badges202 silver badges292 bronze badges

Collectives™ on Stack Overflow

Most frequent occurrence (mode) of numpy array values based on IDs in another array

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related