4

Whats the appropriate way to get the average value from a large masked array? Normally i would just call .mean() but for very large array's this fails for me.

Consider creating an array of a million elements, all with value 500, like:

a = np.ones(1000000, dtype=np.int16) * 500

Then create a random mask and combine both in a new masked array:

mask = np.random.randint(0, 2, a.size)
b = np.ma.masked_array(a, mask=mask)

The array b inherits the dtype and is also a int16.

Getting the average value from b can be done in different ways, but all give the same result. But the non-ma functions are ignoring the mask and shouldn't be used.

print(np.average(b), np.mean(b), np.ma.average(b), np.ma.mean(b))
(500.0, 500.0, 500.0, 500.0)

But if i increase the original array size from one million to ten million, the result becomes:

print(np.average(b), np.mean(b), np.ma.average(b), np.ma.mean(b))
(500.0, -359.19365132075774, -359.19365132075774, -359.19365132075774)

Now only np.average seems correct, but as said is ignoring the mask and calculating the average over the entire array, this can be shown when changing some of the masked values with b[b.mask] = 1000 for example. I would expect np.mean to do the same though.

Casting the masked array b to a float32 results in:

b = b.astype(np.float32)
print(np.average(b), np.mean(b), np.ma.average(b), np.ma.mean(b))
(511.18945, 510.37895680000003, 510.37895680000003, 510.37895680000003)

And casting the masked array b to a float64 (which should be done by default according to the documentation) results in:

b = b.astype(np.float64)
print(np.average(b), np.mean(b), np.ma.average(b), np.ma.mean(b))
(500.0, 500.0, 500.0, 500.0)

So casting to a float64 seems to work, but i would rather avoid this since it would increase the memory footprint a lot.

And while testing some of this, i also noted that calling np.ma.average on the non-masked array (a) gives the correct result if the size is one million and the wrong result when its ten million, while np.ma.mean is correct for both sizes.

Can someone explain the relation between dtype and the size of the array in this example? Its a bit of a mystery to me when this occurs and how to handle it properly.

All is done in Numpy 1.8.1 on a 64bit Win 7 machine. Installed via conda.

Here is a notebook replicating what i have done:

http://nbviewer.ipython.org/gist/RutgerK/69b60da73f464900310a

1 Answer 1

2

this can be shown when changing some of the masked values with b[b.mask] = 1000 for example. I would expect np.mean to do the same though.

This is not correct, b.mask is True where there are masked values. When you assign a new value to the masked values you are unmasking them, so effectively making all the values in the array valid, you can use instead b[np.invert(b.mask)].

So this should work:

import numpy as np

a = np.ones(10000000, dtype=np.int64) * 500

mask = np.random.randint(0, 2, a.size)
b = np.ma.masked_array(a, mask=mask)

b[np.invert(b.mask)] = 1000
print(np.average(b), np.mean(b), np.ma.average(b), np.ma.mean(b))

Which will give you the correct value except for np.average.

Apart from that when you are getting negative/incorrect values it's because you are getting an integer overflow. Using dtype=np.int64 instead should solve it,

Edit: Another option would be to use Python integers with dtype=object instead of fixed width integers, but that would be slower,this change makes np.average crash, but the rest of the methods work properly.

Edit 2: As spoken in the comments, in this case it's not necessary to increase the size of the elements of the array, we can just call np.mean(b, dtype=np.float64) so that np.mean uses a bigger accumulator to avoid overflowing.

Sign up to request clarification or add additional context in comments.

4 Comments

You're right about unmasking assigned values, thanks for pointing that out. But whats overflowing? Is Numpy making a sum somewhere in the background? My values are all 500 in this example, far from the int16 limit. If Numpy is summing somewhere, what dtype does it use? I don't get any message/error about integer overflow when running my script.
Yes, you won't get usually any notice of Integer Overflow, it's one of the dangers of using numpy, when you work with fixed width integers it's much faster but it doesn't check for overflows, so you have to keep in mind which data you are using. In this case you are right, you can keep using int16 for your array, the problem is that np.mean uses by default an accumulator of the same size as the original data. You can fix it calling np.mean(b, dtype=np.float64) to tell np.mean to use a bigger accumulator instead of the same size as the original data.
Thanks. Forcing the dtype certainly works, but im not convinced about whats going wrong. The normal mean function works fine anyway, without setting the accumulator to float64. The ma.mean function works also fine, unless i specify a 'per-element' mask. This works: np.ma.array(np.ones(n, dtype=np.int16) * 500).mean(), this doesnt: np.ma.array(np.ones(n, dtype=np.int16) * 500, mask=np.random.randint(0,2,n)).mean(), for n is ten million. Is there a larger accumulator necessary if you mask elements?
Yes, according to the documentation np.mean and np.ma.mean should be equivalent, but in my system they are defined in different places and the standard accumulator size of np.ma.sum for the sum when calculating the mean is the standard size of an integer in the system, which may be int32, while for np.mean the standard dtype for the accumulator is float64.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.