Given a 1D array of indices:
a = array([1, 0, 3])
I want to one-hot encode this as a 2D array:
b = array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])
Create a zeroed array b with enough columns, i.e. a.max() + 1.
Then, for each row i, set the a[i]th column to 1.
>>> a = np.array([1, 0, 3])
>>> b = np.zeros((a.size, a.max() + 1))
>>> b[np.arange(a.size), a] = 1
>>> b
array([[ 0., 1., 0., 0.],
[ 1., 0., 0., 0.],
[ 0., 0., 0., 1.]])
Simply project indexes on the corresponding identity matrix:
>>> indexes = [1, 0, 3]
>>> n_values = np.max(indexes) + 1
>>> np.eye(n_values)[indexes]
array([[ 0., 1., 0., 0.],
[ 1., 0., 0., 0.],
[ 0., 0., 0., 1.]])
values should be a Numpy array rather than a Python list, then it works in all dimensions, not only in 1D.np.max(values) + 1 as number of buckets might not be desirable if your data set is say randomly sampled and just by chance it may not contain max value. Number of buckets should be rather a parameter and assertion/check can be in place to check that each value is within 0 (incl) and buckets count (excl).numpy docs): at each location in the original matrix (values), we have an integer k, and we "put" the 1-hot vector eye(n)[k] in that location. This adds a dimension because we're "putting" a vector in the location of a scalar in the original matrix.In case you are using keras, there is a built in utility for that:
from keras.utils.np_utils import to_categorical
categorical_labels = to_categorical(int_labels, num_classes=3)
And it does pretty much the same as @YXD's answer (see source-code).
Here is what I find useful:
def one_hot(a, num_classes):
return np.squeeze(np.eye(num_classes)[a.reshape(-1)])
Here num_classes stands for number of classes you have. So if you have a vector with shape of (10000,) this function transforms it to (10000,C). Note that a is zero-indexed, i.e. one_hot(np.array([0, 1]), 2) will give [[1, 0], [0, 1]].
Exactly what you wanted to have I believe.
PS: the source is Sequence models - deeplearning.ai
np.eye(num_classes)[a.reshape(-1)]. What you are simply doing is using np.eye` you are creating a diagonal matrix with each class index as 1 rest zero and later using the indexes provided by a.reshape(-1) producing the output corresponding to the index in np.eye(). I didn't understand the need of np.sqeeze since we use it to simply remove single dimensions which we will never have as in the output's dimension will always be (a_flattened_size, num_classes)You can also use eye function of numpy:
numpy.eye(number of classes)[vector containing the labels]
np.identity(num_classes)[indices] might be better. Nice answer!numpy.eye(num_class)[labels.reshape(-1)]. So for example the labels dimension is (x,1) then it will not produce (num_class, x, 1) dimension.You can use sklearn.preprocessing.LabelBinarizer:
Example:
import sklearn.preprocessing
a = [1,0,3]
label_binarizer = sklearn.preprocessing.LabelBinarizer()
label_binarizer.fit(range(max(a)+1))
b = label_binarizer.transform(a)
print('{0}'.format(b))
output:
[[0 1 0 0]
[1 0 0 0]
[0 0 0 1]]
Amongst other things, you may initialize sklearn.preprocessing.LabelBinarizer() so that the output of transform is sparse.
>>> import numpy as np >>> import pandas >>> a = np.array([1,0,3]) >>> one_hot_encode=pandas.get_dummies(a) >>> print(one_hot_encode) 0 1 3 0 0 1 0 1 1 0 0 2 0 0 1 >>> print(one_hot_encode[1]) 0 1 1 0 2 0 Name: 1, dtype: uint8 >>> print(one_hot_encode[0]) 0 0 1 1 2 0 Name: 0, dtype: uint8 >>> print(one_hot_encode[3]) 0 0 1 0 2 1 Name: 3, dtype: uint8Here is a function that converts a 1-D vector to a 2-D one-hot array.
#!/usr/bin/env python
import numpy as np
def convertToOneHot(vector, num_classes=None):
"""
Converts an input 1-D vector of integers into an output
2-D array of one-hot vectors, where an i'th input value
of j will set a '1' in the i'th row, j'th column of the
output array.
Example:
v = np.array((1, 0, 4))
one_hot_v = convertToOneHot(v)
print one_hot_v
[[0 1 0 0 0]
[1 0 0 0 0]
[0 0 0 0 1]]
"""
assert isinstance(vector, np.ndarray)
assert len(vector) > 0
if num_classes is None:
num_classes = np.max(vector)+1
else:
assert num_classes > 0
assert num_classes >= np.max(vector)
result = np.zeros(shape=(len(vector), num_classes))
result[np.arange(len(vector)), vector] = 1
return result.astype(int)
Below is some example usage:
>>> a = np.array([1, 0, 3])
>>> convertToOneHot(a)
array([[0, 1, 0, 0],
[1, 0, 0, 0],
[0, 0, 0, 1]])
>>> convertToOneHot(a, num_classes=10)
array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]])
assert to check vector shape ;) ).assert ___ into if not ___ raise Exception(<Reason>).You can use the following code for converting into a one-hot vector:
let x is the normal class vector having a single column with classes 0 to some number:
import numpy as np
np.eye(x.max()+1)[x]
if 0 is not a class; then remove +1.
I think the short answer is no. For a more generic case in n dimensions, I came up with this:
# For 2-dimensional data, 4 values
a = np.array([[0, 1, 2], [3, 2, 1]])
z = np.zeros(list(a.shape) + [4])
z[list(np.indices(z.shape[:-1])) + [a]] = 1
I am wondering if there is a better solution -- I don't like that I have to create those lists in the last two lines. Anyway, I did some measurements with timeit and it seems that the numpy-based (indices/arange) and the iterative versions perform about the same.
Just to elaborate on the excellent answer from K3---rnc, here is a more generic version:
def onehottify(x, n=None, dtype=float):
"""1-hot encode x with the max value n (computed from data if n is None)."""
x = np.asarray(x)
n = np.max(x) + 1 if n is None else n
return np.eye(n, dtype=dtype)[x]
Also, here is a quick-and-dirty benchmark of this method and a method from the currently accepted answer by YXD (slightly changed, so that they offer the same API except that the latter works only with 1D ndarrays):
def onehottify_only_1d(x, n=None, dtype=float):
x = np.asarray(x)
n = np.max(x) + 1 if n is None else n
b = np.zeros((len(x), n), dtype=dtype)
b[np.arange(len(x)), x] = 1
return b
The latter method is ~35% faster (MacBook Pro 13 2015), but the former is more general:
>>> import numpy as np
>>> np.random.seed(42)
>>> a = np.random.randint(0, 9, size=(10_000,))
>>> a
array([6, 3, 7, ..., 5, 8, 6])
>>> %timeit onehottify(a, 10)
188 µs ± 5.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit onehottify_only_1d(a, 10)
139 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
If using tensorflow, there is one_hot():
import tensorflow as tf
import numpy as np
a = np.array([1, 0, 3])
depth = 4
b = tf.one_hot(a, depth)
# <tf.Tensor: shape=(3, 3), dtype=float32, numpy=
# array([[0., 1., 0.],
# [1., 0., 0.],
# [0., 0., 0.]], dtype=float32)>
def one_hot(n, class_num, col_wise=True):
a = np.eye(class_num)[n.reshape(-1)]
return a.T if col_wise else a
# Column for different hot
print(one_hot(np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 9, 9, 9, 8, 7]), 10))
# Row for different hot
print(one_hot(np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 9, 9, 9, 8, 7]), 10, col_wise=False))
I recently ran into a problem of same kind and found said solution which turned out to be only satisfying if you have numbers that go within a certain formation. For example if you want to one-hot encode following list:
all_good_list = [0,1,2,3,4]
go ahead, the posted solutions are already mentioned above. But what if considering this data:
problematic_list = [0,23,12,89,10]
If you do it with methods mentioned above, you will likely end up with 90 one-hot columns. This is because all answers include something like n = np.max(a)+1. I found a more generic solution that worked out for me and wanted to share with you:
import numpy as np
import sklearn
sklb = sklearn.preprocessing.LabelBinarizer()
a = np.asarray([1,2,44,3,2])
n = np.unique(a)
sklb.fit(n)
b = sklb.transform(a)
I hope someone encountered same restrictions on above solutions and this might come in handy
Here's a dimensionality-independent standalone solution.
This will convert any N-dimensional array arr of nonnegative integers to a one-hot N+1-dimensional array one_hot, where one_hot[i_1,...,i_N,c] = 1 means arr[i_1,...,i_N] = c. You can recover the input via np.argmax(one_hot, -1)
def expand_integer_grid(arr, n_classes):
"""
:param arr: N dim array of size i_1, ..., i_N
:param n_classes: C
:returns: one-hot N+1 dim array of size i_1, ..., i_N, C
:rtype: ndarray
"""
one_hot = np.zeros(arr.shape + (n_classes,))
axes_ranges = [range(arr.shape[i]) for i in range(arr.ndim)]
flat_grids = [_.ravel() for _ in np.meshgrid(*axes_ranges, indexing='ij')]
one_hot[flat_grids + [arr.ravel()]] = 1
assert((one_hot.sum(-1) == 1).all())
assert(np.allclose(np.argmax(one_hot, -1), arr))
return one_hot
Here is an example function that I wrote to do this based upon the answers above and my own use case:
def label_vector_to_one_hot_vector(vector, one_hot_size=10):
"""
Use to convert a column vector to a 'one-hot' matrix
Example:
vector: [[2], [0], [1]]
one_hot_size: 3
returns:
[[ 0., 0., 1.],
[ 1., 0., 0.],
[ 0., 1., 0.]]
Parameters:
vector (np.array): of size (n, 1) to be converted
one_hot_size (int) optional: size of 'one-hot' row vector
Returns:
np.array size (vector.size, one_hot_size): converted to a 'one-hot' matrix
"""
squeezed_vector = np.squeeze(vector, axis=-1)
one_hot = np.zeros((squeezed_vector.size, one_hot_size))
one_hot[np.arange(squeezed_vector.size), squeezed_vector] = 1
return one_hot
label_vector_to_one_hot_vector(vector=[[2], [0], [1]], one_hot_size=3)
I am adding for completion a simple function, using only numpy operators:
def probs_to_onehot(output_probabilities):
argmax_indices_array = np.argmax(output_probabilities, axis=1)
onehot_output_array = np.eye(np.unique(argmax_indices_array).shape[0])[argmax_indices_array.reshape(-1)]
return onehot_output_array
It takes as input a probability matrix: e.g.:
[[0.03038822 0.65810204 0.16549407 0.3797123 ] ... [0.02771272 0.2760752 0.3280924 0.33458805]]
And it will return
[[0 1 0 0] ... [0 0 0 1]]
For a more general cases where you have a ndarray, you could use numpy broadcasting:
a = array([[[[1, 0, 3]]]]) # (1, 1, 1, 3)
b = (a[..., np.newaxis] == np.arange(np.max(a) + 1)).astype(np.int32)
which would give you:
array([[[[[0, 1, 0, 0],
[1, 0, 0, 0],
[0, 0, 0, 1]]]]], dtype=int32)
The result shape is (1, 1, 1, 3, 4).
Simple example using np.put_along_axis:
x = rng.integers(0, 10, 20)
t = np.zeros([20, 10])
np.put_along_axis(t, indices=np.expand_dims(x, 1), values=1, axis=1)
print(x)
print(t)
[8 6 5 2 3 0 0 0 1 8 6 9 5 6 9 7 6 5 5 9]
[[0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]
Use the following code. It works best.
def one_hot_encode(x):
"""
argument
- x: a list of labels
return
- one hot encoding matrix (number of labels, number of class)
"""
encoded = np.zeros((len(x), 10))
for idx, val in enumerate(x):
encoded[idx][val] = 1
return encoded
Found it here P.S You don't need to go into the link.
import numpy as np
a = np.array([1,0,3])
b = np.array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])
from neuraxle.steps.numpy import OneHotEncoder
encoder = OneHotEncoder(nb_columns=4)
b_pred = encoder.transform(a)
assert b_pred == b
Link to documentation: neuraxle.steps.numpy.OneHotEncoder