In [342]: sequence = np.array([[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
...: [0.1, 0.2, 0.3, 0.4],
...: [0.5, 0.6, 0.7, 0.8, 0.9],
...: [9, 8, 7, 0.6, 0.5, 0.4]])
...:
...: x = pd.DataFrame(sequence.tolist()).T.values
...:
...: min_max_scaler = preprocessing.StandardScaler()
...: x_scaled = min_max_scaler.fit_transform(x)
sequence is an array of lists:
In [343]: sequence
Out[343]:
array([list([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]),
list([0.1, 0.2, 0.3, 0.4]), list([0.5, 0.6, 0.7, 0.8, 0.9]),
list([9, 8, 7, 0.6, 0.5, 0.4])], dtype=object)
Putting this in the dataframe (and then back out) makes a 2d array with nan padding. Running this through the scaling:
In [344]: x_scaled
Out[344]:
array([[-1.54919334, -1.34164079, -1.41421356, 1.25177113],
[-1.161895 , -0.4472136 , -0.70710678, 0.98824036],
[-0.77459667, 0.4472136 , 0. , 0.7247096 ],
[-0.38729833, 1.34164079, 0.70710678, -0.96188729],
[ 0. , nan, 1.41421356, -0.98824036],
[ 0.38729833, nan, nan, -1.01459344],
[ 0.77459667, nan, nan, nan],
[ 1.161895 , nan, nan, nan],
[ 1.54919334, nan, nan, nan]])
An alternative is to pass each list through the scaling by itself:
In [345]: [min_max_scaler.fit_transform(np.reshape(alist,(-1,1))).ravel() for al
...: ist in sequence]
Out[345]:
[array([-1.54919334, -1.161895 , -0.77459667, -0.38729833, 0. ,
0.38729833, 0.77459667, 1.161895 , 1.54919334]),
array([-1.34164079, -0.4472136 , 0.4472136 , 1.34164079]),
array([-1.41421356, -0.70710678, 0. , 0.70710678, 1.41421356]),
array([ 1.25177113, 0.98824036, 0.7247096 , -0.96188729, -0.98824036,
-1.01459344])]
===
There's a collection of numpy.nan... functions that operate on arrays, omitting the nan. Using a utility function from that, we can remove the nan from each column of x_scaled:
In [349]: [np.lib.nanfunctions._remove_nan_1d(col)[0] for col in x_scaled.T]
Out[349]:
[array([-1.54919334, -1.161895 , -0.77459667, -0.38729833, 0. ,
0.38729833, 0.77459667, 1.161895 , 1.54919334]),
array([-1.34164079, -0.4472136 , 0.4472136 , 1.34164079]),
array([-1.41421356, -0.70710678, 0. , 0.70710678, 1.41421356]),
array([ 1.25177113, 0.98824036, 0.7247096 , -0.96188729, -0.98824036,
-1.01459344])]
or we could do the same thing apply np.isnan directly:
In [351]: [col[~np.isnan(col)] for col in x_scaled.T]
Out[351]:
[array([-1.54919334, -1.161895 , -0.77459667, -0.38729833, 0. ,
0.38729833, 0.77459667, 1.161895 , 1.54919334]),
array([-1.34164079, -0.4472136 , 0.4472136 , 1.34164079]),
array([-1.41421356, -0.70710678, 0. , 0.70710678, 1.41421356]),
array([ 1.25177113, 0.98824036, 0.7247096 , -0.96188729, -0.98824036,
-1.01459344])]