5

I'm trying to build a function that returns the products of subsets of array elements. Basically I want to build a prod_by_group function that does this:

values = np.array([1, 2, 3, 4, 5, 6])
groups = np.array([1, 1, 1, 2, 3, 3])

Vprods = prod_by_group(values, groups)

And the resulting Vprods should be:

Vprods
array([6, 4, 30])

There's a great answer here for sums of elements that I think it should be similar to: https://stackoverflow.com/a/4387453/1085691

I tried taking the log first, then sum_by_group, then exp, but ran into numerical issues.

There are some other similar answers here for min and max of elements by group: https://stackoverflow.com/a/8623168/1085691

Edit: Thanks for the quick answers! I'm trying them out. I should add that I want it to be as fast as possible (that's the reason I'm trying to get it in numpy in some vectorized way, like the examples I gave).

Edit: I evaluated all the answers given so far, and the best one is given by @seberg below. Here's the full function that I ended up using:

def prod_by_group(values, groups):
    order = np.argsort(groups)
    groups = groups[order]
    values = values[order]
    group_changes = np.concatenate(([0], np.where(groups[:-1] != groups[1:])[0] + 1))
    return np.multiply.reduceat(values, group_changes)
4
  • 5
    You might want to look at pandas, which is built on Numpy and provides very useful functions for grouping data and computing aggregate functions over the groups. Commented Nov 16, 2012 at 19:49
  • 1
    @BrenBarn that's not particularly helpful, we could at least narrow it down to a function that might be similar to this case. Commented Nov 16, 2012 at 20:05
  • The functions are called group_by and aggregate, but my point is that if you want to do this a lot, it pays to read the pandas documentation and learn to use pandas as a whole, because it makes this kind of thing easy with its entire setup of data structures. Commented Nov 16, 2012 at 20:08
  • Thanks for the link. I'll probably try out pandas some day when it shows up in an ubuntu repository for whatever version of ubuntu I'm using. Commented Nov 16, 2012 at 20:56

5 Answers 5

3

If you groups are already sorted (if they are not you can do that with np.argsort), you can do this using the reduceat functionality to ufuncs (if they are not sorted, you would have to sort them first to do it efficiently):

# you could do the group_changes somewhat faster if you care a lot
group_changes = np.concatenate(([0], np.where(groups[:-1] != groups[1:])[0] + 1))
Vprods = np.multiply.reduceat(values, group_changes)

Or mgilson answer if you have few groups. But if you have many groups, then this is much more efficient. Since you avoid boolean indices for every element in the original array for every group. Plus you avoid slicing in a python loop with reduceat.

Of course pandas does these operations conveniently.

Edit: Sorry had prod in there. The ufunc is multiply. You can use this method for any binary ufunc. This means it works for basically all numpy functions that can work element wise on two input arrays. (ie. multiply normally multiplies two arrays elementwise, add adds them, maximum/minimum, etc. etc.)

Sign up to request clarification or add additional context in comments.

1 Comment

Tested this one out, it's giving me the right answer with about a 50X speedup over my for-loop version... Awesome!
1

First set up a mask for the groups such that you expand the groups in another dimension

mask=(groups==unique(groups).reshape(-1,1))
mask
array([[ True,  True,  True, False, False, False],
       [False, False, False,  True, False, False],
       [False, False, False, False,  True,  True]], dtype=bool)

now we multiply with val

mask*val
array([[1, 2, 3, 0, 0, 0],
       [0, 0, 0, 4, 0, 0],
       [0, 0, 0, 0, 5, 6]])

now you can already do prod along the axis 1 except for those zeros, which is easy to fix:

prod(where(mask*val,mask*val,1),axis=1)
array([ 6,  4, 30])

Comments

1

As suggested in the comments, you can also use the Pandas module. Using the grouby() function, this task becomes an one-liner:

import numpy as np
import pandas as pd

values = np.array([1, 2, 3, 4, 5, 6])
groups = np.array([1, 1, 1, 2, 3, 3])

df = pd.DataFrame({'values': values, 'groups': groups})

So df then looks as follows:

   groups  values
0       1       1
1       1       2
2       1       3
3       2       4
4       3       5
5       3       6

Now you can groupby() the groups column and apply numpy's prod() function to each of the groups like this

 df.groupby(groups)['values'].apply(np.prod)

which gives you the desired output:

1     6
2     4
3    30

Comments

0

Well, I doubt this is a great answer, but it's the best I can come up with:

np.array([np.product(values[np.flatnonzero(groups == x)]) for x in np.unique(groups)])

2 Comments

This works, but takes about 150% the time as my non-vectorized for-loop version
@Nate -- Well that's a bummer :).
0

It's not a numpy solution, but it's fairly readable (I find that sometimes numpy solutions aren't!):

from operator import itemgetter, mul
from itertools import groupby

grouped = groupby(zip(groups, values), itemgetter(0))
groups = [reduce(mul, map(itemgetter(1), vals), 1) for key, vals in grouped]
print groups
# [6, 4, 30]

2 Comments

I had to do "from operator import itemgetter, mul" because I was getting this error: "NameError: global name 'mul' is not defined" After doing that, it works great but only gave me about a 7X speedup over my for-loop version.
@Nate thanks - mea culpa on forgetting mul (well, copying/paste muck up) - edited for future reference

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.