1

I'm trying to sum the elements of separate data array by their characteristics efficiently. I have three identifying characteristics (age, year, and cause) in a given array, and for each age, year, cause, I have 1000 values. I need to add those values to another data array when the characteristics are the same. For now, I'm doing something like this where each datasets is ~ (80000, 1000):

import numpy as np
datasets = np.vstack(dataset1, dataset2)
for a in ages:
    for y in years:
        for c in causes:
             output = np.sum(datasets[(age==a) & (year==y) & (cause==c)], axis = 0)

However, with 60,000 iterations, this is incredibly slow. The challenge is that the arrays don't necessarily all have the same shape. Any thoughts?

3
  • I thought of the groupby function of matplotlib.mlab. This would be something like this: matplotlib.mlab.rec_groupby(datasets, groupby = ('age', 'year', 'cause', ), stats = (('values', np.sum, 'sums' ), )) with a structured array with age, year and cause as fields and values as a field with an array of length 1000. But the problem is that I have not figured out how you can pass the axis = 0 argument with this. Because now it sums all 1000 values of the different rows to one total sum. Commented Sep 12, 2011 at 21:00
  • I found a great result here: stackoverflow.com/questions/7416901/… Commented Sep 17, 2011 at 4:47
  • Ten years after asking this question, things have changed: Nowadays there is this package doing the job for you: github.com/ml31415/numpy-groupies Commented Jun 10, 2022 at 19:10

2 Answers 2

2

I'd recommend something like accumarray. Your output should be a 3-dimensional data cube where each dimension corresponds to a variable (age, year, cause). Each index in each dimension corresponds to a unique value in your input lists. You can then use something like this cookbook example to accumulate the datasets variable into the appropriate bins using age, year, and cause.

You might also consider using a proper relational database. They're quite fast at these sorts of things. Python ships with sqlite3 as a part of the core. Unfortunately, it's a rather steep learning curve if you've never worked with a relational database before. You'll want to use the group and aggregate functionality.

Sign up to request clarification or add additional context in comments.

Comments

0

SEE LINK BELOW

I'm not sure how to properly link another answer to this answer. When I tried one sentence followed by the link, it converted the answer to a comment. I'm now being long-winded to try and make stack-overflow think that this text is long enough to constitute an answer. Here is the link to a great answer to this question.

Summing Arrays by Characteristics in Python

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.