Numpy Mean Structured Array

Question

Suppose that I have a structured array of students (strings) and test scores (ints), where each entry is the score that a specific student received on a specific test. Each student has multiple entries in this array, naturally.

Example

import numpy
grades = numpy.array([('Mary', 96), ('John', 94), ('Mary', 88), ('Edgar', 89), ('John', 84)],
                     dtype=[('student', 'a50'), ('score', 'i')])

print grades
#[('Mary', 96) ('John', 94) ('Mary', 88) ('Edgar', 89) ('John', 84)]

How do I easily compute the average score of each student? In other words, how do I take the mean of the array in the 'score' dimension? I'd like to do

grades.mean('score')

and have Numpy return

[('Mary', 92), ('John', 89), ('Edgar', 89)]

but Numpy complains

TypeError: an integer is required

Is there a Numpy-esque way to do this easily? I think it might involve taking a view of the structured array with a different dtype. Any help would be appreciated. Thanks.

Edit

>>> grades = numpy.zeros(5, dtype=[('student', 'a50'), ('score', 'i'), ('testid', 'i'])
>>> grades[0] = ('Mary', 96, 1)
>>> grades[1] = ('John', 94, 1)
>>> grades[2] = ('Mary', 88, 2)
>>> grades[3] = ('Edgar', 89, 1)
>>> grades[4] = ('John', 84, 2)
>>> np.mean(grades, 'testid')
TypeError: an integer is required

ecatmur · Accepted Answer · 2012-08-16 14:38:54Z

4

NumPy isn't designed to be able to group rows together and apply aggregate functions to those groups. You could:

use itertools.groupby and reconstruct the array;
use Pandas, which is based on NumPy and is great at grouping; or
add another dimension to the array for the test id (so this case would be a 2x3 array, because it looks like there were two tests).

Here's the itertools solution, but as you can see it's quite complicated and inefficient. I'd recommend one of the other two methods.

np.array([(k, np.array(list(g), dtype=grades.dtype).view(np.recarray)['score'].mean())
          for k, g in groupby(np.sort(grades, order='student').view(np.recarray),
                              itemgetter('student'))], dtype=grades.dtype)

edited Aug 16, 2012 at 14:38

answered Aug 16, 2012 at 14:30

ecatmur

158k28 gold badges311 silver badges387 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Jeremy Over a year ago

I don't understand how adding another dimension would help.

ecatmur Over a year ago

@Jeremy the extra dimension is for the test id. So for 3 students and 2 tests you have a 2x3 array.

Jeremy Over a year ago

Right. As it happens, in my program I do have a testid dimension already. How does that help me?

ecatmur Over a year ago

@Jeremy then you can call np.mean(axis=1) over the test-id axis.

ecatmur Over a year ago

@Jeremy that would be ` numpy.zeros((3, 2), dtype=[('student', 'a50'), ('score', 'i')])` and then grades[0, 0] = ('Mary', 96) etc.

|

Jeremy · Accepted Answer · 2012-08-20 21:05:21Z

1

matplotlib.mlab.rec_groupby was exactly what I was looking for.

answered Aug 20, 2012 at 21:05

Jeremy

3252 silver badges11 bronze badges

Comments

gg349 · Accepted Answer · 2012-08-16 17:55:01Z

0

A little bit faster and simpler solution based on itertools, without using view(), is

[(k,e['score'][list(g)].mean()) for k, g in groupby(argsort(e),e['student'].__getitem__ )]

This is the same idea of ecatmur, but works in terms of indices employing argsort() instead of sort.

answered Aug 16, 2012 at 17:55

gg349

22.8k5 gold badges58 silver badges65 bronze badges

Comments

CPBL · Accepted Answer · 2012-10-20 19:20:32Z

0

collapseByField(grades,'student') gives what you want, after:

def collapseByField(e,collapsefield,keepFields=None,agg=None):
   import numpy as np
   assert isinstance(e,np.ndarray) # Structured array
   if agg is None:
       agg=np.mean
   if keepFields is None:
       newf=[(n,agg,n) for n in e.dtype.names if n not in (collapsefield)]
   import matplotlib as mpl
   return(mpl.mlab.rec_groupby(e,[collapsefield],newf))

answered Oct 20, 2012 at 19:20

CPBL

4,0704 gold badges37 silver badges47 bronze badges

Collectives™ on Stack Overflow

Numpy Mean Structured Array

Example

Edit

4 Answers 4

6 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Example

Edit

4 Answers 4

6 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related