3

Is there a way to make a group by aggregation by multiple columns in numpy? Im trying to do it with this module: https://github.com/ml31415/numpy-groupies Goal is to get a faster groupby than pandas. for example:

group_idx = np.array([
np.array([4, 3, 3, 4, 4, 1, 1, 1, 7, 8, 7, 4, 3, 3, 1, 1]),
np.array([4, 3, 2, 4, 7, 1, 4, 1, 7, 8, 7, 2, 3, 1, 14 1]),
np.array([1, 2, 3, 4, 5, 1, 1, 2, 3, 4, 5, 4, 2, 3, 1, 1])
]
a = np.array([1, 2, 1, 2, 1, 2, 1, 2, 3, 4, 5, 4, 2, 3, 1, 1])

result = aggregate(group_idx, a, func='sum')

It should be like pandas df.groupby(['column1','column2','column3']).sum().reset_index()

6
  • Are they all positive numbers in group_idx? Commented Oct 26, 2020 at 9:06
  • yes, there are only positive values in group_idx. Commented Oct 26, 2020 at 9:23
  • Are you OK with adding a dependency on numba? Commented Oct 26, 2020 at 17:06
  • Sure, I think thats possible. Commented Oct 27, 2020 at 8:25
  • Can you show us the exact output format? Are you looking for 2D array output with index and a summations? Commented Oct 28, 2020 at 12:01

1 Answer 1

3
+50

Given that group_idx has positive values, we can use a dimensionality-reduction based method. We are assuming the first three columns as the groupby ones and the last (fourth) one as the data column to be summed.

Approach #1

We will stick to NumPy tools and also bring in pandas.factorize in the mix.

group_idx = df.iloc[:,:3].values
a = df.iloc[:,-1].values

s = group_idx.max(0)+1
lidx = np.ravel_multi_index(group_idx.T,s)

sidx, unq_lidx = pd.factorize(lidx)
pp = np.empty(len(unq_lidx), dtype=int)
pp[sidx] = np.arange(len(sidx))
k1 = group_idx[pp]

a_sums = np.bincount(sidx,a)
out = np.hstack((k1, a_sums.astype(int)[:,None]))

Approach #2

Bringing in numba and sorting -

import numba as nb

@nb.njit
def step_sum(a_s, step_mask, out, group_idx_s):
    N = len(a_s)
    count_iter = 0
    for j in nb.prange(3):
        out[count_iter,j] = group_idx_s[0,j] 
    out[count_iter,3] = a_s[0]
    for i in nb.prange(1,N):
        if step_mask[i-1]:
            out[count_iter,3] += a_s[i]
        else:
            count_iter += 1
            for j in nb.prange(3):
                out[count_iter,j] = group_idx_s[i,j] 
            out[count_iter,3] = a_s[i]
    return out

group_idx = df.iloc[:,:3].values
a = df.iloc[:,-1].values

s = group_idx.max(0)+1
lidx = np.ravel_multi_index(group_idx.T,s)

sidx = lidx.argsort()
lsidx = lidx[sidx]
group_idx_s = group_idx[sidx]
a_s = a[sidx]

step_mask = lsidx[:-1] == lsidx[1:]    
N = len(lsidx)-step_mask.sum()
out = np.zeros((N, 4), dtype=int)
out = step_sum(a_s, step_mask, out, group_idx_s)

Comparison checks

For doing the comparison check, we can use something like :

# get pandas o/p and lexsort
p = df.groupby(['agg_a','agg_b','agg_c'])['to_sum'].sum().reset_index().values
p = p[np.lexsort(p[:,:3].T)]

# Output from our approaches here, say `out`. Let's lexsort
out = out[np.lexsort(out[:,:3].T)]

print(np.allclose(out, p))
Sign up to request clarification or add additional context in comments.

11 Comments

Thanks for the code. It looks good, but if we are handling bigger arrays, np.bincount runs into memory problems, because it wants to allocate too much memory. On speed test, its much faster with 199 µs and 5.85 ms for pandas.
@ChristianFrei Check out just added Tweak #1? So, skip the last three steps from App#1 and use the tweak steps instead.
Thanks I have tested it. On small datasets its working well. But on a bigger one I get a shape missmatch on np.vstack: ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 2357420 and the array at index 1 has size 2355491 The difference is only 2071...
@ChristianFrei Can you edit np.bincount(sidx,a) to np.bincount(sidx,a, minlength=unq_groups.shape[1]) and try?
There is a small Issue on the last line of approach 2. You have to remove "lsidx" from the arguments, because its not part of step_sum func.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.