6

I am trying to calculate the moving average in a large numpy array that contains NaNs. Currently I am using:

import numpy as np

def moving_average(a,n=5):
      ret = np.cumsum(a,dtype=float)
      ret[n:] = ret[n:]-ret[:-n]
      return ret[-1:]/n

When calculating with a masked array:

x = np.array([1.,3,np.nan,7,8,1,2,4,np.nan,np.nan,4,4,np.nan,1,3,6,3])
mx = np.ma.masked_array(x,np.isnan(x))
y = moving_average(mx).filled(np.nan)

print y

>>> array([3.8,3.8,3.6,nan,nan,nan,2,2.4,nan,nan,nan,2.8,2.6])

The result I am looking for (below) should ideally have NaNs only in the place where the original array, x, had NaNs and the averaging should be done over the number of non-NaN elements in the grouping (I need some way to change the size of n in the function.)

y = array([4.75,4.75,nan,4.4,3.75,2.33,3.33,4,nan,nan,3,3.5,nan,3.25,4,4.5,3])

I could loop over the entire array and check index by index but the array I am using is very large and that would take a long time. Is there a numpythonic way to do this?

7
  • So, is that [4.75,4.75,nan,4.4,3.75,2.33,3.33,4,nan,nan,3,3.5,nan,3.25] the expected output? If so, why is there a NaN as the third element? Commented Oct 7, 2016 at 14:12
  • @Divakar It is the expected output. In the original array (x), there is a nan as the third entry. Commented Oct 7, 2016 at 14:15
  • So why do we have a NaN as the second last entry in the expected output? Commented Oct 7, 2016 at 14:16
  • Edited it to show the remaining averages; forgot to add them sorry. Commented Oct 7, 2016 at 14:19
  • 1
    @Divakar the answer with the np.cumsum approach gave the fastest result with my actual data (changed the accepted answer.) All of the answers gave the result I wanted Commented Oct 7, 2016 at 17:54

6 Answers 6

2

Pandas has a lot of really nice functionality with this. For example:

x = np.array([np.nan, np.nan, 3, 3, 3, np.nan, 5, 7, 7])

# requires three valid values in a row or the resulting value is null

print(pd.Series(x).rolling(3).mean())

#output
nan,nan,nan, nan, 3, nan, nan, nan, 6.333

# only requires 2 valid values out of three for size=3 window

print(pd.Series(x).rolling(3, min_periods=2).mean())

#output
nan, nan, nan, 3, 3, 3, 4, 6, 6.3333

You can play around with the windows/min_periods and consider filling-in nulls all in one chained line of code.

Sign up to request clarification or add additional context in comments.

Comments

1

I'll just add to the great answers before that you could still use cumsum to achieve this:

import numpy as np

def moving_average(a, n=5):
    ret = np.cumsum(a.filled(0))
    ret[n:] = ret[n:] - ret[:-n]
    counts = np.cumsum(~a.mask)
    counts[n:] = counts[n:] - counts[:-n]
    ret[~a.mask] /= counts[~a.mask]
    ret[a.mask] = np.nan

    return ret

x = np.array([1.,3,np.nan,7,8,1,2,4,np.nan,np.nan,4,4,np.nan,1,3,6,3])
mx = np.ma.masked_array(x,np.isnan(x))
y = moving_average(mx)

Comments

0

You could create a temporary array and use np.nanmean() (new in version 1.8 if I'm not mistaken):

import numpy as np
temp = np.vstack([x[i:-(5-i)] for i in range(5)]) # stacks vertically the strided arrays
means = np.nanmean(temp, axis=0)

and put original nan back in place with means[np.isnan(x[:-5])] = np.nan

However this look redundant both in terms of memory (stacking the same array strided 5 times) and computation.

2 Comments

np.nanmean() does not return nan anywhere in the output array.
@krakenwagon, yes, you add them back with the line I edited right before your comment.
0

If I understand correctly, you want to create a moving average and then populate the resulting elements as nan if their index in the original array was nan.

import numpy as np

>>> inc = 5 #the moving avg increment 

>>> x = np.array([1.,3,np.nan,7,8,1,2,4,np.nan,np.nan,4,4,np.nan,1,3,6,3])
>>> mov_avg = np.array([np.nanmean(x[idx:idx+inc]) for idx in range(len(x))])

# Determine indices in x that are nans 
>>> nan_idxs = np.where(np.isnan(x))[0]

# Populate output array with nans
>>> mov_avg[nan_idxs] = np.nan
>>> mov_avg
array([ 4.75, 4.75, nan, 4.4, 3.75, 2.33333333, 3.33333333, 4., nan, nan, 3., 3.5, nan, 3.25, 4., 4.5, 3.])

Comments

0

Here's an approach using strides -

w = 5 # Window size
n = x.strides[0]      
avgs = np.nanmean(np.lib.stride_tricks.as_strided(x, \
                        shape=(x.size-w+1,w), strides=(n,n)),1)

x_rem = np.append(x[-w+1:],np.full(w-1,np.nan))
avgs_rem = np.nanmean(np.lib.stride_tricks.as_strided(x_rem, \
                               shape=(w-1,w), strides=(n,n)),1)
avgs = np.append(avgs,avgs_rem)                               
avgs[np.isnan(x)] = np.nan

Comments

0

Currently bottleneck package should do the trick quite reliably and quickly. Here is slightly adjusted example from https://kwgoodman.github.io/bottleneck-doc/reference.html#bottleneck.move_mean:

>>> import bottleneck as bn
>>> a = np.array([1.0, 2.0, 3.0, np.nan, 5.0])
>>> bn.move_mean(a, window=2)
array([ nan,  1.5,  2.5,  nan,  nan])
>>> bn.move_mean(a, window=2, min_count=1)
array([ 1. ,  1.5,  2.5,  3. ,  5. ])

Note that the resulting means correspond to the last index of the window.

The package is available from Ubuntu repos, pip etc. It can operate over arbitrary axis of numpy-array etc. Besides that, it is claimed to be faster than plain-numpy implementation in many cases.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.