Remove Nan from Array of arrays

Question

I would like to remove NaNs from a set arrays within an array. I have seen questions where people have asked how to remove rows/columns but here I specificy would like to remove those elements.

Here is a data where I normalize each array independently

sequence = array([[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 
                  [0.1, 0.2, 0.3, 0.4],
                  [0.5, 0.6, 0.7, 0.8, 0.9],
                  [9, 8, 7, 0.6, 0.5, 0.4]])

x = pd.DataFrame(sequence.tolist()).T.values

min_max_scaler = preprocessing.StandardScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)
sequence_normalized = df.T

The result looks like the following

What I expect is an output similar to

([[1.54, -1.16, -0.77, -0.38, 0.0, 0.38, 0.77, 1.16, 1.54], 
                  [-1.34, -0.44, 0.44, 1.36],
                  [-1.41, 0.71, 0.0, 0.71, 1.41],
                  [1.25, 0.98, 0.72, -0.96, 0.98, -1.01]])

hpaulj · Accepted Answer · 2020-01-02 20:09:55Z

In [342]: sequence = np.array([[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],  
     ...:                   [0.1, 0.2, 0.3, 0.4], 
     ...:                   [0.5, 0.6, 0.7, 0.8, 0.9], 
     ...:                   [9, 8, 7, 0.6, 0.5, 0.4]]) 
     ...:  
     ...: x = pd.DataFrame(sequence.tolist()).T.values 
     ...:  
     ...: min_max_scaler = preprocessing.StandardScaler() 
     ...: x_scaled = min_max_scaler.fit_transform(x)

sequence is an array of lists:

In [343]: sequence                                                              
Out[343]: 
array([list([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]),
       list([0.1, 0.2, 0.3, 0.4]), list([0.5, 0.6, 0.7, 0.8, 0.9]),
       list([9, 8, 7, 0.6, 0.5, 0.4])], dtype=object)

Putting this in the dataframe (and then back out) makes a 2d array with nan padding. Running this through the scaling:

In [344]: x_scaled                                                              
Out[344]: 
array([[-1.54919334, -1.34164079, -1.41421356,  1.25177113],
       [-1.161895  , -0.4472136 , -0.70710678,  0.98824036],
       [-0.77459667,  0.4472136 ,  0.        ,  0.7247096 ],
       [-0.38729833,  1.34164079,  0.70710678, -0.96188729],
       [ 0.        ,         nan,  1.41421356, -0.98824036],
       [ 0.38729833,         nan,         nan, -1.01459344],
       [ 0.77459667,         nan,         nan,         nan],
       [ 1.161895  ,         nan,         nan,         nan],
       [ 1.54919334,         nan,         nan,         nan]])

An alternative is to pass each list through the scaling by itself:

In [345]: [min_max_scaler.fit_transform(np.reshape(alist,(-1,1))).ravel() for al
     ...: ist in sequence]                                                      
Out[345]: 
[array([-1.54919334, -1.161895  , -0.77459667, -0.38729833,  0.        ,
         0.38729833,  0.77459667,  1.161895  ,  1.54919334]),
 array([-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079]),
 array([-1.41421356, -0.70710678,  0.        ,  0.70710678,  1.41421356]),
 array([ 1.25177113,  0.98824036,  0.7247096 , -0.96188729, -0.98824036,
        -1.01459344])]

===

There's a collection of numpy.nan... functions that operate on arrays, omitting the nan. Using a utility function from that, we can remove the nan from each column of x_scaled:

In [349]: [np.lib.nanfunctions._remove_nan_1d(col)[0] for col in  x_scaled.T]   
Out[349]: 
[array([-1.54919334, -1.161895  , -0.77459667, -0.38729833,  0.        ,
         0.38729833,  0.77459667,  1.161895  ,  1.54919334]),
 array([-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079]),
 array([-1.41421356, -0.70710678,  0.        ,  0.70710678,  1.41421356]),
 array([ 1.25177113,  0.98824036,  0.7247096 , -0.96188729, -0.98824036,
        -1.01459344])]

or we could do the same thing apply np.isnan directly:

In [351]: [col[~np.isnan(col)] for col in  x_scaled.T]                          
Out[351]: 
[array([-1.54919334, -1.161895  , -0.77459667, -0.38729833,  0.        ,
         0.38729833,  0.77459667,  1.161895  ,  1.54919334]),
 array([-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079]),
 array([-1.41421356, -0.70710678,  0.        ,  0.70710678,  1.41421356]),
 array([ 1.25177113,  0.98824036,  0.7247096 , -0.96188729, -0.98824036,
        -1.01459344])]

Nicolas Gervais · Accepted Answer · 2020-01-02 18:57:20Z

1

pandas dataframe rows need to be equally sized, so your only choice is to convert to string and replace nan values with an empty string. There needs to be something at these locations. If not nan, something can be an empty string.

sequence_normalized.astype(str).replace('nan', '')

answered Jan 2, 2020 at 18:57

Nicolas Gervais

36.9k23 gold badges123 silver badges160 bronze badges

1 Comment

Areza Over a year ago

well, I am asking for an array of array or array of list - not a dataframe.

Collectives™ on Stack Overflow

Remove Nan from Array of arrays

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related