Python numpy groupby multiple columns

Question

Is there a way to make a group by aggregation by multiple columns in numpy? Im trying to do it with this module: https://github.com/ml31415/numpy-groupies Goal is to get a faster groupby than pandas. for example:

group_idx = np.array([
np.array([4, 3, 3, 4, 4, 1, 1, 1, 7, 8, 7, 4, 3, 3, 1, 1]),
np.array([4, 3, 2, 4, 7, 1, 4, 1, 7, 8, 7, 2, 3, 1, 14 1]),
np.array([1, 2, 3, 4, 5, 1, 1, 2, 3, 4, 5, 4, 2, 3, 1, 1])
]
a = np.array([1, 2, 1, 2, 1, 2, 1, 2, 3, 4, 5, 4, 2, 3, 1, 1])

result = aggregate(group_idx, a, func='sum')

It should be like pandas df.groupby(['column1','column2','column3']).sum().reset_index()

Can you show us the exact output format? Are you looking for 2D array output with index and a summations? — Divakar
– Divakar, Commented Oct 28, 2020 at 12:01

Divakar · Accepted Answer · 2020-10-30 10:08:52Z

3

+50

Given that group_idx has positive values, we can use a dimensionality-reduction based method. We are assuming the first three columns as the groupby ones and the last (fourth) one as the data column to be summed.

Approach #1

We will stick to NumPy tools and also bring in pandas.factorize in the mix.

group_idx = df.iloc[:,:3].values
a = df.iloc[:,-1].values

s = group_idx.max(0)+1
lidx = np.ravel_multi_index(group_idx.T,s)

sidx, unq_lidx = pd.factorize(lidx)
pp = np.empty(len(unq_lidx), dtype=int)
pp[sidx] = np.arange(len(sidx))
k1 = group_idx[pp]

a_sums = np.bincount(sidx,a)
out = np.hstack((k1, a_sums.astype(int)[:,None]))

Approach #2

Bringing in numba and sorting -

import numba as nb

@nb.njit
def step_sum(a_s, step_mask, out, group_idx_s):
    N = len(a_s)
    count_iter = 0
    for j in nb.prange(3):
        out[count_iter,j] = group_idx_s[0,j] 
    out[count_iter,3] = a_s[0]
    for i in nb.prange(1,N):
        if step_mask[i-1]:
            out[count_iter,3] += a_s[i]
        else:
            count_iter += 1
            for j in nb.prange(3):
                out[count_iter,j] = group_idx_s[i,j] 
            out[count_iter,3] = a_s[i]
    return out

group_idx = df.iloc[:,:3].values
a = df.iloc[:,-1].values

s = group_idx.max(0)+1
lidx = np.ravel_multi_index(group_idx.T,s)

sidx = lidx.argsort()
lsidx = lidx[sidx]
group_idx_s = group_idx[sidx]
a_s = a[sidx]

step_mask = lsidx[:-1] == lsidx[1:]    
N = len(lsidx)-step_mask.sum()
out = np.zeros((N, 4), dtype=int)
out = step_sum(a_s, step_mask, out, group_idx_s)

Comparison checks

For doing the comparison check, we can use something like :

# get pandas o/p and lexsort
p = df.groupby(['agg_a','agg_b','agg_c'])['to_sum'].sum().reset_index().values
p = p[np.lexsort(p[:,:3].T)]

# Output from our approaches here, say `out`. Let's lexsort
out = out[np.lexsort(out[:,:3].T)]

print(np.allclose(out, p))

edited Oct 30, 2020 at 10:08

answered Oct 28, 2020 at 12:00

Divakar

222k19 gold badges273 silver badges374 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Christian Frei Over a year ago

Thanks for the code. It looks good, but if we are handling bigger arrays, np.bincount runs into memory problems, because it wants to allocate too much memory. On speed test, its much faster with 199 µs and 5.85 ms for pandas.

Divakar Over a year ago

@ChristianFrei Check out just added Tweak #1? So, skip the last three steps from App#1 and use the tweak steps instead.

Christian Frei Over a year ago

Thanks I have tested it. On small datasets its working well. But on a bigger one I get a shape missmatch on np.vstack: ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 2357420 and the array at index 1 has size 2355491 The difference is only 2071...

Divakar Over a year ago

@ChristianFrei Can you edit np.bincount(sidx,a) to np.bincount(sidx,a, minlength=unq_groups.shape[1]) and try?

Christian Frei Over a year ago

There is a small Issue on the last line of approach 2. You have to remove "lsidx" from the arguments, because its not part of step_sum func.

|

Collectives™ on Stack Overflow

Python numpy groupby multiple columns

1 Answer 1

11 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

11 Comments

Your Answer

Sign up or log in

Post as a guest

Related