Product of array elements by group in numpy (Python)

Question

I'm trying to build a function that returns the products of subsets of array elements. Basically I want to build a prod_by_group function that does this:

values = np.array([1, 2, 3, 4, 5, 6])
groups = np.array([1, 1, 1, 2, 3, 3])

Vprods = prod_by_group(values, groups)

And the resulting Vprods should be:

Vprods
array([6, 4, 30])

There's a great answer here for sums of elements that I think it should be similar to: https://stackoverflow.com/a/4387453/1085691

I tried taking the log first, then sum_by_group, then exp, but ran into numerical issues.

There are some other similar answers here for min and max of elements by group: https://stackoverflow.com/a/8623168/1085691

Edit: Thanks for the quick answers! I'm trying them out. I should add that I want it to be as fast as possible (that's the reason I'm trying to get it in numpy in some vectorized way, like the examples I gave).

Edit: I evaluated all the answers given so far, and the best one is given by @seberg below. Here's the full function that I ended up using:

def prod_by_group(values, groups):
    order = np.argsort(groups)
    groups = groups[order]
    values = values[order]
    group_changes = np.concatenate(([0], np.where(groups[:-1] != groups[1:])[0] + 1))
    return np.multiply.reduceat(values, group_changes)

You might want to look at pandas, which is built on Numpy and provides very useful functions for grouping data and computing aggregate functions over the groups. — BrenBarn
– BrenBarn, Commented Nov 16, 2012 at 19:49
@BrenBarn that's not particularly helpful, we could at least narrow it down to a function that might be similar to this case. — enjoylife
– enjoylife, Commented Nov 16, 2012 at 20:05
The functions are called group_by and aggregate, but my point is that if you want to do this a lot, it pays to read the pandas documentation and learn to use pandas as a whole, because it makes this kind of thing easy with its entire setup of data structures. — BrenBarn
– BrenBarn, Commented Nov 16, 2012 at 20:08
Thanks for the link. I'll probably try out pandas some day when it shows up in an ubuntu repository for whatever version of ubuntu I'm using. — Nate
– Nate, Commented Nov 16, 2012 at 20:56

seberg · Accepted Answer · 2012-11-16 20:14:02Z

3

If you groups are already sorted (if they are not you can do that with np.argsort), you can do this using the reduceat functionality to ufuncs (if they are not sorted, you would have to sort them first to do it efficiently):

# you could do the group_changes somewhat faster if you care a lot
group_changes = np.concatenate(([0], np.where(groups[:-1] != groups[1:])[0] + 1))
Vprods = np.multiply.reduceat(values, group_changes)

Or mgilson answer if you have few groups. But if you have many groups, then this is much more efficient. Since you avoid boolean indices for every element in the original array for every group. Plus you avoid slicing in a python loop with reduceat.

Of course pandas does these operations conveniently.

Edit: Sorry had prod in there. The ufunc is multiply. You can use this method for any binary ufunc. This means it works for basically all numpy functions that can work element wise on two input arrays. (ie. multiply normally multiplies two arrays elementwise, add adds them, maximum/minimum, etc. etc.)

edited Nov 16, 2012 at 20:14

answered Nov 16, 2012 at 20:04

seberg

9,0352 gold badges34 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Nate Over a year ago

Tested this one out, it's giving me the right answer with about a 50X speedup over my for-loop version... Awesome!

cronos · Accepted Answer · 2012-11-16 20:03:52Z

1

First set up a mask for the groups such that you expand the groups in another dimension

mask=(groups==unique(groups).reshape(-1,1))
mask
array([[ True,  True,  True, False, False, False],
       [False, False, False,  True, False, False],
       [False, False, False, False,  True,  True]], dtype=bool)

now we multiply with val

mask*val
array([[1, 2, 3, 0, 0, 0],
       [0, 0, 0, 4, 0, 0],
       [0, 0, 0, 0, 5, 6]])

now you can already do prod along the axis 1 except for those zeros, which is easy to fix:

prod(where(mask*val,mask*val,1),axis=1)
array([ 6,  4, 30])

answered Nov 16, 2012 at 20:03

cronos

2,46819 silver badges18 bronze badges

Comments

Cleb · Accepted Answer · 2016-06-30 23:11:50Z

1

As suggested in the comments, you can also use the Pandas module. Using the grouby() function, this task becomes an one-liner:

import numpy as np
import pandas as pd

values = np.array([1, 2, 3, 4, 5, 6])
groups = np.array([1, 1, 1, 2, 3, 3])

df = pd.DataFrame({'values': values, 'groups': groups})

So df then looks as follows:

   groups  values
0       1       1
1       1       2
2       1       3
3       2       4
4       3       5
5       3       6

Now you can groupby() the groups column and apply numpy's prod() function to each of the groups like this

 df.groupby(groups)['values'].apply(np.prod)

which gives you the desired output:

1     6
2     4
3    30

answered Jun 30, 2016 at 23:11

Cleb

26.3k23 gold badges129 silver badges164 bronze badges

Comments

mgilson · Accepted Answer · 2012-11-16 20:02:31Z

0

Well, I doubt this is a great answer, but it's the best I can come up with:

np.array([np.product(values[np.flatnonzero(groups == x)]) for x in np.unique(groups)])

answered Nov 16, 2012 at 20:02

mgilson

312k70 gold badges656 silver badges722 bronze badges

2 Comments

Nate Over a year ago

This works, but takes about 150% the time as my non-vectorized for-loop version

mgilson Over a year ago

@Nate -- Well that's a bummer :).

Jon Clements · Accepted Answer · 2012-11-17 10:14:00Z

0

It's not a numpy solution, but it's fairly readable (I find that sometimes numpy solutions aren't!):

from operator import itemgetter, mul
from itertools import groupby

grouped = groupby(zip(groups, values), itemgetter(0))
groups = [reduce(mul, map(itemgetter(1), vals), 1) for key, vals in grouped]
print groups
# [6, 4, 30]

edited Nov 17, 2012 at 10:14

answered Nov 16, 2012 at 20:07

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

2 Comments

Nate Over a year ago

I had to do "from operator import itemgetter, mul" because I was getting this error: "NameError: global name 'mul' is not defined" After doing that, it works great but only gave me about a 7X speedup over my for-loop version.

Jon Clements Over a year ago

@Nate thanks - mea culpa on forgetting mul (well, copying/paste muck up) - edited for future reference

Collectives™ on Stack Overflow

Product of array elements by group in numpy (Python)

5 Answers 5

1 Comment

Comments

Comments

2 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

Comments

Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related