1

I've got a 2-row array called C like this:

from numpy import *
A = [1,2,3,4,5]
B = [50,40,30,20,10]
C = vstack((A,B))

I want to take all the columns in C where the value in the first row falls between i and i+2, and average them. I can do this with just A no problem:

i = 0
A_avg = []

while(i<6):
    selection = A[logical_and(A >= i, A < i+2)] 
    A_avg.append(mean(selection))
    i += 2

then A_avg is:

[1.0,2.5,4.5]

I want to carry out the same process with my two-row array C, but I want to take the average of each row separately, while doing it in a way that's dictated by the first row. For example, for C, I want to end up with a 2 x 3 array that looks like:

[[1.0,2.5,4.5],
 [50,35,15]]

Where the first row is A averaged in blocks between i and i+2 as before, and the second row is B averaged in the same blocks as A, regardless of the values it has. So the first entry is unchanged, the next two get averaged together, and the next two get averaged together, for each row separately. Anyone know of a clever way to do this? Many thanks!

5
  • 1
    Convert the averaging procedure to a matrix multiplication. Examine the first row and calculate the matrix. Then multiply this matrix by the entire data matrix. Done correctly, this can create the same kind of averages made here. Commented Apr 25, 2014 at 16:46
  • So I should get an average for the first row, and multiply it by the matrix containing both data rows? Sorry, I'm not sure if I follow you completely. Commented Apr 25, 2014 at 17:37
  • Depending on array size, the averaging via matrix multiplication sounds like a elegant option. When the data is of large scale, sparse matrices can be used. As a matter of fact, the way the problem is posed at this moment, you either need to do binary masking or matrix multiplication, no optimization due to structure possible. So thumbs up for matrix multiplication if the problem is as general as it is described. Question: What is the nature of A? Is it truly an array of consecutive integers or something very much more general, with different intervals, and unsorted? Commented Apr 25, 2014 at 18:42
  • @eickenberg, in my case, A is a big 1 x 96100 array of steadily increasing floats, that increases more slowly as you go down the array. B is a 1 x 96100 array of unsorted very small numbers (ie, 1.2367*10**(-22)). Would that be cause for the matrix multiplication method? Commented Apr 25, 2014 at 18:49
  • OK, I was thinking that there would be more rows than just B. In this case it seems silly to build a matrix with information you could have applied to the vectors directly. As a matter of fact, this observation is probably general. The result will always be the same, no matter how small or big your numbers are. My initial question was more about data dimensions Commented Apr 25, 2014 at 19:31

1 Answer 1

1

I hope this is not too clever. TIL boolean indexing does not broadcast, so I had to manually do the broadcasting. Let me know if anything is unclear.

import numpy as np
A = [1,2,3,4,5]
B = [50,40,30,20,10]
C = np.vstack((A,B)) # float so that I can use np.nan

i = np.arange(0, 6, 2)[:, None]
selections = np.logical_and(A >= i, A < i+2)[None]

D, selections = np.broadcast_arrays(C[:, None], selections)
D = D.astype(float)     # allows use of nan, and makes a copy to prevent repeated behavior
D[~selections] = np.nan # exclude these elements from mean

D = np.nanmean(D, axis=-1)

Then,

>>> D
array([[  1. ,   2.5,   4.5],
       [ 50. ,  35. ,  15. ]])

Another way, using np.histogram to bin your data. This may be faster for large arrays, but is only useful for few rows, since a hist must be done with different weights for each row:

bins = np.arange(0, 7, 2)     # include the end
n = np.histogram(A, bins)[0]  # number of columns in each bin
a_mean = np.histogram(A, bins, weights=A)[0]/n
b_mean = np.histogram(A, bins, weights=B)[0]/n
D = np.vstack([a_mean, b_mean])
Sign up to request clarification or add additional context in comments.

7 Comments

Awesome! Very clever, I like the idea of using nanmean. When I run it though, I get an error saying "output parameter for reduction operation add has too many dimensions", which is probably coming from the last line of the code, right?
Oh whoops, I made that change after I test it. Just use D = np.nanmean(D, -1) without the out parameter, since the reduction changes the shape you can't do it in-place.
Doesn't setting D = np.nanmean(D, -1) result in D=[[ nan nan nan] [ nan nan nan]]? I must be doing something wrong, but I think I see what you mean.
OK this is strange, apparently the output arrays from np.broadcast_arrays self-link, so if change an entry in one row it changes them in all rows. It's fixed now by making an explicit copy. Since it must be an explicit copy you may be able to do this just as nicely with np.tile or np.repeat instead of np.broadcast_arrays. Sorry, didn't notice this because I did the astype after the broadcast when I was testing, so it worked with the copy :P
Woah, that is a little weird, I never would have caught that. np.title works nicely, you're right. Thanks so much for the help! np.nanmean has been added to my list of favorite things. Thanks again!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.