Whats the appropriate way to get the average value from a large masked array? Normally i would just call .mean() but for very large array's this fails for me.
Consider creating an array of a million elements, all with value 500, like:
a = np.ones(1000000, dtype=np.int16) * 500
Then create a random mask and combine both in a new masked array:
mask = np.random.randint(0, 2, a.size)
b = np.ma.masked_array(a, mask=mask)
The array b inherits the dtype and is also a int16.
Getting the average value from b can be done in different ways, but all give the same result. But the non-ma functions are ignoring the mask and shouldn't be used.
print(np.average(b), np.mean(b), np.ma.average(b), np.ma.mean(b))
(500.0, 500.0, 500.0, 500.0)
But if i increase the original array size from one million to ten million, the result becomes:
print(np.average(b), np.mean(b), np.ma.average(b), np.ma.mean(b))
(500.0, -359.19365132075774, -359.19365132075774, -359.19365132075774)
Now only np.average seems correct, but as said is ignoring the mask and calculating the average over the entire array, this can be shown when changing some of the masked values with b[b.mask] = 1000 for example. I would expect np.mean to do the same though.
Casting the masked array b to a float32 results in:
b = b.astype(np.float32)
print(np.average(b), np.mean(b), np.ma.average(b), np.ma.mean(b))
(511.18945, 510.37895680000003, 510.37895680000003, 510.37895680000003)
And casting the masked array b to a float64 (which should be done by default according to the documentation) results in:
b = b.astype(np.float64)
print(np.average(b), np.mean(b), np.ma.average(b), np.ma.mean(b))
(500.0, 500.0, 500.0, 500.0)
So casting to a float64 seems to work, but i would rather avoid this since it would increase the memory footprint a lot.
And while testing some of this, i also noted that calling np.ma.average on the non-masked array (a) gives the correct result if the size is one million and the wrong result when its ten million, while np.ma.mean is correct for both sizes.
Can someone explain the relation between dtype and the size of the array in this example? Its a bit of a mystery to me when this occurs and how to handle it properly.
All is done in Numpy 1.8.1 on a 64bit Win 7 machine. Installed via conda.
Here is a notebook replicating what i have done:
http://nbviewer.ipython.org/gist/RutgerK/69b60da73f464900310a