1

I would like to remove NaNs from a set arrays within an array. I have seen questions where people have asked how to remove rows/columns but here I specificy would like to remove those elements.

Here is a data where I normalize each array independently

sequence = array([[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 
                  [0.1, 0.2, 0.3, 0.4],
                  [0.5, 0.6, 0.7, 0.8, 0.9],
                  [9, 8, 7, 0.6, 0.5, 0.4]])

x = pd.DataFrame(sequence.tolist()).T.values

min_max_scaler = preprocessing.StandardScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)
sequence_normalized = df.T

The result looks like the following

s

What I expect is an output similar to

([[1.54, -1.16, -0.77, -0.38, 0.0, 0.38, 0.77, 1.16, 1.54], 
                  [-1.34, -0.44, 0.44, 1.36],
                  [-1.41, 0.71, 0.0, 0.71, 1.41],
                  [1.25, 0.98, 0.72, -0.96, 0.98, -1.01]])
0

2 Answers 2

1
In [342]: sequence = np.array([[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],  
     ...:                   [0.1, 0.2, 0.3, 0.4], 
     ...:                   [0.5, 0.6, 0.7, 0.8, 0.9], 
     ...:                   [9, 8, 7, 0.6, 0.5, 0.4]]) 
     ...:  
     ...: x = pd.DataFrame(sequence.tolist()).T.values 
     ...:  
     ...: min_max_scaler = preprocessing.StandardScaler() 
     ...: x_scaled = min_max_scaler.fit_transform(x)                            

sequence is an array of lists:

In [343]: sequence                                                              
Out[343]: 
array([list([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]),
       list([0.1, 0.2, 0.3, 0.4]), list([0.5, 0.6, 0.7, 0.8, 0.9]),
       list([9, 8, 7, 0.6, 0.5, 0.4])], dtype=object)

Putting this in the dataframe (and then back out) makes a 2d array with nan padding. Running this through the scaling:

In [344]: x_scaled                                                              
Out[344]: 
array([[-1.54919334, -1.34164079, -1.41421356,  1.25177113],
       [-1.161895  , -0.4472136 , -0.70710678,  0.98824036],
       [-0.77459667,  0.4472136 ,  0.        ,  0.7247096 ],
       [-0.38729833,  1.34164079,  0.70710678, -0.96188729],
       [ 0.        ,         nan,  1.41421356, -0.98824036],
       [ 0.38729833,         nan,         nan, -1.01459344],
       [ 0.77459667,         nan,         nan,         nan],
       [ 1.161895  ,         nan,         nan,         nan],
       [ 1.54919334,         nan,         nan,         nan]])

An alternative is to pass each list through the scaling by itself:

In [345]: [min_max_scaler.fit_transform(np.reshape(alist,(-1,1))).ravel() for al
     ...: ist in sequence]                                                      
Out[345]: 
[array([-1.54919334, -1.161895  , -0.77459667, -0.38729833,  0.        ,
         0.38729833,  0.77459667,  1.161895  ,  1.54919334]),
 array([-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079]),
 array([-1.41421356, -0.70710678,  0.        ,  0.70710678,  1.41421356]),
 array([ 1.25177113,  0.98824036,  0.7247096 , -0.96188729, -0.98824036,
        -1.01459344])]

===

There's a collection of numpy.nan... functions that operate on arrays, omitting the nan. Using a utility function from that, we can remove the nan from each column of x_scaled:

In [349]: [np.lib.nanfunctions._remove_nan_1d(col)[0] for col in  x_scaled.T]   
Out[349]: 
[array([-1.54919334, -1.161895  , -0.77459667, -0.38729833,  0.        ,
         0.38729833,  0.77459667,  1.161895  ,  1.54919334]),
 array([-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079]),
 array([-1.41421356, -0.70710678,  0.        ,  0.70710678,  1.41421356]),
 array([ 1.25177113,  0.98824036,  0.7247096 , -0.96188729, -0.98824036,
        -1.01459344])]

or we could do the same thing apply np.isnan directly:

In [351]: [col[~np.isnan(col)] for col in  x_scaled.T]                          
Out[351]: 
[array([-1.54919334, -1.161895  , -0.77459667, -0.38729833,  0.        ,
         0.38729833,  0.77459667,  1.161895  ,  1.54919334]),
 array([-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079]),
 array([-1.41421356, -0.70710678,  0.        ,  0.70710678,  1.41421356]),
 array([ 1.25177113,  0.98824036,  0.7247096 , -0.96188729, -0.98824036,
        -1.01459344])]
Sign up to request clarification or add additional context in comments.

Comments

1

pandas dataframe rows need to be equally sized, so your only choice is to convert to string and replace nan values with an empty string. There needs to be something at these locations. If not nan, something can be an empty string.

sequence_normalized.astype(str).replace('nan', '')

1 Comment

well, I am asking for an array of array or array of list - not a dataframe.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.