31

I'm trying to convert a string array of categorical variables to an integer array of categorical variables.

Ex.

import numpy as np
a = np.array( ['a', 'b', 'c', 'a', 'b', 'c'])
print a.dtype
>>> |S1

b = np.unique(a)
print b
>>>  ['a' 'b' 'c']

c = a.desired_function(b)
print c, c.dtype
>>> [1,2,3,1,2,3] int32

I realize this can be done with a loop but I imagine there is an easier way. Thanks.

9 Answers 9

71

np.unique has some optional returns

return_inverse gives the integer encoding, which I use very often

>>> b, c = np.unique(a, return_inverse=True)
>>> b
array(['a', 'b', 'c'], 
      dtype='|S1')
>>> c
array([0, 1, 2, 0, 1, 2])
>>> c+1
array([1, 2, 3, 1, 2, 3])

it can be used to recreate the original array from uniques

>>> b[c]
array(['a', 'b', 'c', 'a', 'b', 'c'], 
      dtype='|S1')
>>> (b[c] == a).all()
True
Sign up to request clarification or add additional context in comments.

Comments

37

... years later....

For completeness (because this isn't mentioned in the answers) and personal reasons (I always have pandas imported in my modules but not necessarily sklearn), this is also quite straightforward with pandas.get_dummies()

import numpy as np
import pandas

In [1]: a = np.array(['a', 'b', 'c', 'a', 'b', 'c'])

In [2]: b = pandas.get_dummies(a)

In [3]: b
Out[3]: 
      a  b  c
   0  1  0  0
   1  0  1  0
   2  0  0  1
   3  1  0  0
   4  0  1  0
   5  0  0  1

In [3]: b.values.argmax(1)
Out[4]: array([0, 1, 2, 0, 1, 2])

1 Comment

Thanks. Finally found the answer which I'm looking for.
18

One way is to use the categorical function from scikits.statsmodels. For example:

In [60]: from scikits.statsmodels.tools import categorical

In [61]: a = np.array( ['a', 'b', 'c', 'a', 'b', 'c'])

In [62]: b = categorical(a, drop=True)

In [63]: b.argmax(1)
Out[63]: array([0, 1, 2, 0, 1, 2])

The return value from categorical (b) is actually a design matrix, hence the call to argmax above to get it close to your desired format.

In [64]: b
Out[64]:
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

Comments

5

Another option is to use a categorical pandas Series:

>>> import pandas as pd
>>> pd.Series(['a', 'b', 'c', 'a', 'b', 'c'], dtype="category").cat.codes.values

array([0, 1, 2, 0, 1, 2], dtype=int8)

Comments

5

Another way is to use sklearn.preprocessing.LabelEncoder

It can convert hashable labels like strings to numerical values ranging between 0 and n_classes-1.

It is done like this:

# Repeating setup from the question to make example copy/paste-able
import numpy as np
a = np.array( ['a', 'b', 'c', 'a', 'b', 'c'])
b = np.unique(a)

# Answer to the question
from sklearn import preprocessing
pre = preprocessing.LabelEncoder()
pre.fit(b)
c = pre.transform(a)

print(c)    # Prints [0 1 2 0 1 2]

If you insist on having the values start from 1 in the resulting array you could simply do c + 1 afterwards.

It might not be worth it to bring in sklearn as a dependency for a project only to do this, but it is a good option if you have sklearn already imported.

2 Comments

How can we can we know that 'a' is '0' and so on. There is any code that can return such that?
@bib: I believe a new running number/index is allocated every time a new string is encountered while traversing the array from left to right. So 'a' is 0 because it was the first string that was seen.
3

Another approach is to use Pandas factorize to map items to a number:

In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: a = np.array(['a', 'b', 'c', 'a', 'b', 'c'])
In [4]: a_enc = pd.factorize(a)
In [5]: a_enc[0]
Out[5]: array([0, 1, 2, 0, 1, 2])
In [6]: a_enc[1]
Out[6]: array(['a', 'b', 'c'], dtype=object)

Comments

1

Well, this is a hack... but does it help?

In [72]: c=(a.view(np.ubyte)-96).astype('int32')

In [73]: print(c,c.dtype)
(array([1, 2, 3, 1, 2, 3]), dtype('int32'))

1 Comment

You seriously want to add the caveat that this only works for length-1 strings.
1

...some more years pass...

Thought I would provide a pure python solution for completeness:

def count_unique(a):
    def counter(item, c=[0], items={}):
        if item not in items:
            items[item] = c[0]
            c[0] += 1
        return items[item]
    return map(counter, a)

a = [0, 2, 6, 0, 2]
print count_unique(a)
>> [0, 1, 2, 0, 1]

Comments

0

You can also try something like this:

a = np.array( ['a', 'b', 'c', 'a', 'b', 'c'])
a[a == 'a'] = 1
a[a == 'b'] = 2
a[a == 'c'] = 3
a = a.astype(np.float32)

It would be better if you know what's in there and wish to set specific index for each values.

If there's only two categories, next code will work like a charm:

a = np.array( ['a', 'b', 'a', 'b'])
a = np.float32(y == 'a')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.