Randomly remove 30% of values in numpy array

Question

I have a 2D numpy array which contains my values (some of them can be NaN). I want to remove the 30% of the non-NaN values and replace them with the mean of the array. How can I do so? What I tried so far:

def spar_removal(array, mean_value, sparseness):
    array1 = deepcopy(array)
    array2 = array1
    spar_size = int(round(array2.shape[0]*array2.shape[1]*sparseness))
    for i in range (0, spar_size):
        index = np.random.choice(np.where(array2 != mean_value)[1])
        array2[0, index] = mean_value
    return array2

But this is just picking the same row of my array. How can I remove from all over the array? It seems that choice works only for one dimension. I guess what I want is to calculate the (x, y) pairs that I will replace its value with mean_value.

Does it need to be exactly 30% of the non-NaN values, or does each non-NaN value need a 30% chance of being replaced? E.g. if we had 100 non-NaN values, do you need exactly 30 of them to be replaced, or would you be okay with each value getting a 30% chance of being replaced so that sometimes you'd get 27 replacements and very rarely 45? — DSM
– DSM, Commented Jun 9, 2018 at 13:46
There's a difference between remove and replace. Remove implies, at least me, reducing the shape of the array, e.g. from a (100,100) to (90,90) or some such value. While it is easy to remove a whole row or column, removing individual elements is hard without making the array ragged. — hpaulj
– hpaulj, Commented Jun 9, 2018 at 18:11

jedwards · Accepted Answer · 2018-06-09 13:45:10Z

5

There's likely a better way, but consider:

import numpy as np

x = np.array([[1,2,3,4],
              [1,2,3,4],
              [np.NaN, np.NaN, np.NaN, np.NaN],
              [1,2,3,4]])

# Get a vector of 1-d indexed indexes of non NaN elements
indices = np.where(np.isfinite(x).ravel())[0]

# Shuffle the indices, select the first 30% (rounded down with int())
to_replace = np.random.permutation(indices)[:int(indices.size * 0.3)]

# Replace those indices with the mean (ignoring NaNs)
x[np.unravel_index(to_replace, x.shape)] = np.nanmean(x)

print(x)

Example Output

[[ 2.5  2.   2.5  4. ]
 [ 1.   2.   3.   4. ]
 [ nan  nan  nan  nan]
 [ 2.5  2.   3.   4. ]]

NaNs will never change and floor(0.3 * number of non-NaN elements) will be set to the mean (the mean ignoring NaNs).

answered Jun 9, 2018 at 13:45

jedwards

30.3k3 gold badges69 silver badges94 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ernie Yang · Accepted Answer · 2018-06-09 13:59:56Z

1

Since where returns two array contains the indexs, this is what you want:

def spar_removal(array, mean_value, sparseness):

    array1 = copy.deepcopy(array)
    array2 = array1
    spar_size = int(round(array2.shape[0]*array2.shape[1]*sparseness))
    # This is used to filtered out nan
    indexs = np.where(array2==array2)
    indexsL = len(indexs[0])

    for i in np.random.choice(indexsL,spar_size,replace=False):
        indexX = indexs[0][i]
        indexY = indexs[1][i]
        array2[indexX,indexY] = mean_value

return array2

edited Jun 9, 2018 at 13:59

answered Jun 9, 2018 at 13:53

Ernie Yang

1245 bronze badges

Collectives™ on Stack Overflow

Randomly remove 30% of values in numpy array

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related