pythonic way to aggregate arrays (numpy or not)

Question

I would like to make a nice function to aggregate data among an array (it's a numpy record array, but it does not change anything)

you have an array of data that you want to aggregate among one axis: for example an array of dtype=[(name, (np.str_,8), (job, (np.str_,8), (income, np.uint32)] and you want to have the mean income per job

I did this function, and in the example it should be called as aggregate(data,'job','income',mean)

def aggregate(data, key, value, func):

    data_per_key = {}

    for k,v in zip(data[key], data[value]):

        if k not in data_per_key.keys():

            data_per_key[k]=[]

        data_per_key[k].append(v)

    return [(k,func(data_per_key[k])) for k in data_per_key.keys()]

the problem is that I find it not very nice I would like to have it in one line: do you have any ideas?

Thanks for your answer Louis

PS: I would like to keep the func in the call so that you can also ask for median, minimum...

I don't know numpy, but your dtype does seem to have a problem with the brackets.. — int3
– int3, Commented Dec 1, 2009 at 22:27
The parenthesis don't match. Makes for some extra confusion. — Skylar Saveland
– Skylar Saveland, Commented Dec 1, 2009 at 22:51
I don't understand your comment that you "would like to have it in one line". When you call the function, that will be one line. Does it matter how many lines the function itself has? Anyway, I think your best bet is to use defaultdict as the answers say. — steveha
– steveha, Commented Dec 1, 2009 at 23:51
soory for the mismatch, I changed the names and types to be explicit and forgot some brackets... in 1 line as in the matplotlib.mlab answer — Louis
– Louis, Commented Dec 2, 2009 at 20:14
Michael and I have created a package called numpy-groupies, which includes a function for this. The package is on pypi. — dan-man
– dan-man, Commented Jul 6, 2015 at 13:38

Community · Accepted Answer · 2017-05-23 12:13:26Z

5

Perhaps the function you are seeking is matplotlib.mlab.rec_groupby:

import matplotlib.mlab

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

result=matplotlib.mlab.rec_groupby(data, ('job',), (('income',np.mean,'avg_income'),))

yields

('Digger', 4.0)
('Planter', 2.5)
('Waterer', 3.0)

matplotlib.mlab.rec_groupby returns a recarray:

print(result.dtype)
# [('job', '|S7'), ('avg_income', '<f8')]

You may also be interested in checking out pandas, which has even more versatile facilities for handling group-by operations.

edited May 23, 2017 at 12:13

CommunityBot

11 silver badge

answered Dec 2, 2009 at 0:09

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Louis Over a year ago

that's exactly what I was looking for: the job done in one line! Moreover it's returning directly an array! Perfect!

Hank Gay · Accepted Answer · 2009-12-01 22:51:37Z

5

Your if k not in data_per_key.keys() could be rewritten as if k not in data_per_key, but you can do even better with defaultdict. Here's a version that uses defaultdict to get rid of the existence check:

import collections

def aggregate(data, key, value, func):
    data_per_key = collections.defaultdict(list)
    for k,v in zip(data[key], data[value]):
        data_per_key[k].append(v)

    return [(k,func(data_per_key[k])) for k in data_per_key.keys()]

answered Dec 1, 2009 at 22:51

Hank Gay

72.4k36 gold badges164 silver badges224 bronze badges

3 Comments

John La Rooy Over a year ago

I'd change the last line to return [(k,f(v)) for k,v in data_per_key.items()]

Hank Gay Over a year ago

That's a good call, but I was trying to highlight the defaultdict stuff by making that the only change. Your return is definitely better, though.

Louis Over a year ago

thanks for the defaultdict trick! and also for the final iteration

caiohamamura · Accepted Answer · 2015-06-17 13:14:53Z

2

Best flexibility and readability is get using pandas:

import pandas

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

df = pandas.DataFrame(data)
result = df.groupby('job').mean()

Yields to :

         income
job
Digger      4.0
Planter     2.5
Waterer     3.0

Pandas DataFrame is a great class to work with, but you can get back your results as you need:

result.to_records()
result.to_dict()
result.to_csv()

And so on...

edited Jun 17, 2015 at 13:14

answered Jul 24, 2014 at 14:55

caiohamamura

2,89025 silver badges27 bronze badges

2 Comments

Michael Over a year ago

pandas is about an order of magnitude slower than my solution given above. See the speed comparison there.

caiohamamura Over a year ago

@Michael, sorry, actually I didn't mean performance, I'm aware that pandas is not a library aiming top performance, I myself prefer using approaches like bincount for performance. I've edited the original post.

caiohamamura · Accepted Answer · 2017-09-21 14:34:24Z

Best performance is achieved using ndimage.mean from scipy. This will be twice faster than accepted answer for this small dataset, and about 3.5 times faster for larger inputs:

from scipy import ndimage

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

unique = np.unique(data['job'])
result=np.dstack([unique, ndimage.mean(data['income'], data['job'], unique)])

Will yield to:

array([[['Digger', '4.0'],
        ['Planter', '2.5'],
        ['Waterer', '3.0']]],
      dtype='|S32')

EDIT: with bincount (faster!)

This is about 5x faster than accepted answer for the small example input, if you repeat the data 100000 times it will be around 8.5x faster:

unique, uniqueInd, uniqueCount = np.unique(data['job'], return_inverse=True, return_counts=True)
means = np.bincount(uniqueInd, data['income'])/uniqueCount
return np.dstack([unique, means])

Michael · Accepted Answer · 2022-06-10 18:52:12Z

2

Update 2022:

There is a package which emulates the functionality of matlabs accumarray quite well. You can install it via pip install numpy_groupies or find it here:

https://github.com/ml31415/numpy-groupies

edited Jun 10, 2022 at 18:52

answered Jan 12, 2013 at 10:07

Michael

7,8061 gold badge41 silver badges64 bronze badges

Comments

Skylar Saveland · Accepted Answer · 2009-12-01 22:51:01Z

0

http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#dictionary-get-method

should help to make it a little prettier, more pythonic, more efficient possibly. I'll come back later to check on your progress. Maybe you can edit the function with this in mind? Also see the next couple of sections.

answered Dec 1, 2009 at 22:51

Skylar Saveland

11.5k10 gold badges77 silver badges96 bronze badges

Collectives™ on Stack Overflow

pythonic way to aggregate arrays (numpy or not)

6 Answers 6

1 Comment

3 Comments

2 Comments

EDIT: with bincount (faster!)

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

1 Comment

3 Comments

2 Comments

EDIT: with bincount (faster!)

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related