1

I'd like to take the average of one vector based on grouping information in another vector. The two vectors are the same length. I've created a minimal example below based on averaging predictions for each user. How do I do that in NumPy?

       >>> pred
           [ 0.99  0.23  0.11  0.64  0.45  0.55 0.76  0.72  0.97 ] 
       >>> users
           ['User2' 'User3' 'User2' 'User3' 'User0' 'User1' 'User4' 'User4' 'User4']
2
  • Your two arrays are different lengths... Also are you looking for a solution in NumPy or (a much easier solution) in Pandas? Commented Mar 24, 2015 at 22:24
  • Sorry about that, they're now the same length. I'd prefer to stay in NumPy as I'm just learning Python and have decided to postpone Pandas for a little while. Commented Mar 24, 2015 at 22:32

3 Answers 3

4

A 'pure numpy' solution might use a combination of np.unique and np.bincount:

import numpy as np

pred = [0.99,  0.23,  0.11,  0.64,  0.45,  0.55, 0.76,  0.72,  0.97]
users = ['User2', 'User3', 'User2', 'User3', 'User0', 'User1', 'User4',
         'User4', 'User4']

# assign integer indices to each unique user name, and get the total
# number of occurrences for each name
unames, idx, counts = np.unique(users, return_inverse=True, return_counts=True)

# now sum the values of pred corresponding to each index value
sum_pred = np.bincount(idx, weights=pred)

# finally, divide by the number of occurrences for each user name
mean_pred = sum_pred / counts

print(unames)
# ['User0' 'User1' 'User2' 'User3' 'User4']

print(mean_pred)
# [ 0.45        0.55        0.55        0.435       0.81666667]

If you have pandas installed, DataFrames have some very nice methods for grouping and summarizing data:

import pandas as pd

df = pd.DataFrame({'name':users, 'pred':pred})

print(df.groupby('name').mean())
#            pred
# name           
# User0  0.450000
# User1  0.550000
# User2  0.550000
# User3  0.435000
# User4  0.816667
Sign up to request clarification or add additional context in comments.

2 Comments

I don't really understand what you mean by "unique label for each user" - in your example it seems that User2 would have corresponding label values of both 0 and 1. Also, on SO you should post follow-up questions separately (you can include a link to the original question in order to provide context).
Okay, I'll do that. Thanks.
1

If you want to stick to numpy, the simplest is to use np.unique and np.bincount:

>>> pred = np.array([0.99, 0.23, 0.11, 0.64, 0.45, 0.55, 0.76, 0.72, 0.97])
>>> users = np.array(['User2', 'User3', 'User2', 'User3', 'User0', 'User1',
...                   'User4', 'User4', 'User4'])
>>> unq, idx, cnt = np.unique(users, return_inverse=True, return_counts=True)
>>> avg = np.bincount(idx, weights=pred) / cnt
>>> unq
array(['User0', 'User1', 'User2', 'User3', 'User4'],
      dtype='|S5')
>>> avg
array([ 0.45      ,  0.55      ,  0.55      ,  0.435     ,  0.81666667])

Comments

1

A compact solution is to use numpy_indexed (disclaimed: I am its author), which implements a solution similar to the vectorized one proposed by Jaime; but with a cleaner interface and more tests:

import numpy_indexed as npi
npi.group_by(users).mean(pred)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.