numpy convert categorical string arrays to an integer array

Question

I'm trying to convert a string array of categorical variables to an integer array of categorical variables.

Ex.

import numpy as np
a = np.array( ['a', 'b', 'c', 'a', 'b', 'c'])
print a.dtype
>>> |S1

b = np.unique(a)
print b
>>>  ['a' 'b' 'c']

c = a.desired_function(b)
print c, c.dtype
>>> [1,2,3,1,2,3] int32

I realize this can be done with a loop but I imagine there is an easier way. Thanks.

Josef · Accepted Answer · 2010-07-14 20:24:54Z

71

np.unique has some optional returns

return_inverse gives the integer encoding, which I use very often

>>> b, c = np.unique(a, return_inverse=True)
>>> b
array(['a', 'b', 'c'], 
      dtype='|S1')
>>> c
array([0, 1, 2, 0, 1, 2])
>>> c+1
array([1, 2, 3, 1, 2, 3])

it can be used to recreate the original array from uniques

>>> b[c]
array(['a', 'b', 'c', 'a', 'b', 'c'], 
      dtype='|S1')
>>> (b[c] == a).all()
True

answered Jul 14, 2010 at 20:24

Josef

23.1k3 gold badges60 silver badges73 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

benjaminmgross · Accepted Answer · 2015-09-01 17:27:10Z

37

... years later....

For completeness (because this isn't mentioned in the answers) and personal reasons (I always have pandas imported in my modules but not necessarily sklearn), this is also quite straightforward with pandas.get_dummies()

import numpy as np
import pandas

In [1]: a = np.array(['a', 'b', 'c', 'a', 'b', 'c'])

In [2]: b = pandas.get_dummies(a)

In [3]: b
Out[3]: 
      a  b  c
   0  1  0  0
   1  0  1  0
   2  0  0  1
   3  1  0  0
   4  0  1  0
   5  0  0  1

In [3]: b.values.argmax(1)
Out[4]: array([0, 1, 2, 0, 1, 2])

answered Sep 1, 2015 at 17:27

benjaminmgross

2,1622 gold badges26 silver badges30 bronze badges

1 Comment

SeeTheC Over a year ago

Thanks. Finally found the answer which I'm looking for.

unutbu · Accepted Answer · 2012-12-04 02:02:53Z

18

One way is to use the categorical function from scikits.statsmodels. For example:

In [60]: from scikits.statsmodels.tools import categorical

In [61]: a = np.array( ['a', 'b', 'c', 'a', 'b', 'c'])

In [62]: b = categorical(a, drop=True)

In [63]: b.argmax(1)
Out[63]: array([0, 1, 2, 0, 1, 2])

The return value from categorical (b) is actually a design matrix, hence the call to argmax above to get it close to your desired format.

In [64]: b
Out[64]:
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

edited Dec 4, 2012 at 2:02

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

answered Jul 10, 2010 at 5:12

ars

124k23 gold badges151 silver badges135 bronze badges

Comments

Gregor Sturm · Accepted Answer · 2019-01-20 12:24:41Z

5

Another option is to use a categorical pandas Series:

>>> import pandas as pd
>>> pd.Series(['a', 'b', 'c', 'a', 'b', 'c'], dtype="category").cat.codes.values

array([0, 1, 2, 0, 1, 2], dtype=int8)

answered Jan 20, 2019 at 12:24

Gregor Sturm

2,9502 gold badges30 silver badges34 bronze badges

Comments

Tim Skov Jacobsen · Accepted Answer · 2020-05-13 21:27:41Z

5

Another way is to use sklearn.preprocessing.LabelEncoder

It can convert hashable labels like strings to numerical values ranging between 0 and n_classes-1.

It is done like this:

# Repeating setup from the question to make example copy/paste-able
import numpy as np
a = np.array( ['a', 'b', 'c', 'a', 'b', 'c'])
b = np.unique(a)

# Answer to the question
from sklearn import preprocessing
pre = preprocessing.LabelEncoder()
pre.fit(b)
c = pre.transform(a)

print(c)    # Prints [0 1 2 0 1 2]

If you insist on having the values start from 1 in the resulting array you could simply do c + 1 afterwards.

It might not be worth it to bring in sklearn as a dependency for a project only to do this, but it is a good option if you have sklearn already imported.

answered May 13, 2020 at 21:27

Tim Skov Jacobsen

3,9624 gold badges29 silver badges25 bronze badges

2 Comments

bib Over a year ago

How can we can we know that 'a' is '0' and so on. There is any code that can return such that?

Tim Skov Jacobsen Over a year ago

@bib: I believe a new running number/index is allocated every time a new string is encountered while traversing the array from left to right. So 'a' is 0 because it was the first string that was seen.

tomp · Accepted Answer · 2016-05-09 18:08:41Z

3

Another approach is to use Pandas factorize to map items to a number:

In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: a = np.array(['a', 'b', 'c', 'a', 'b', 'c'])
In [4]: a_enc = pd.factorize(a)
In [5]: a_enc[0]
Out[5]: array([0, 1, 2, 0, 1, 2])
In [6]: a_enc[1]
Out[6]: array(['a', 'b', 'c'], dtype=object)

answered May 9, 2016 at 18:08

tomp

6536 silver badges24 bronze badges

Comments

unutbu · Accepted Answer · 2010-07-03 19:15:51Z

1

Well, this is a hack... but does it help?

In [72]: c=(a.view(np.ubyte)-96).astype('int32')

In [73]: print(c,c.dtype)
(array([1, 2, 3, 1, 2, 3]), dtype('int32'))

answered Jul 3, 2010 at 19:15

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

1 Comment

smci Over a year ago

You seriously want to add the caveat that this only works for length-1 strings.

kezzos · Accepted Answer · 2017-09-26 09:54:10Z

1

...some more years pass...

Thought I would provide a pure python solution for completeness:

def count_unique(a):
    def counter(item, c=[0], items={}):
        if item not in items:
            items[item] = c[0]
            c[0] += 1
        return items[item]
    return map(counter, a)

a = [0, 2, 6, 0, 2]
print count_unique(a)
>> [0, 1, 2, 0, 1]

edited Sep 26, 2017 at 9:54

answered Sep 21, 2017 at 11:27

kezzos

3,2413 gold badges25 silver badges40 bronze badges

Comments

myquertykeyboard · Accepted Answer · 2021-07-13 07:19:40Z

0

You can also try something like this:

a = np.array( ['a', 'b', 'c', 'a', 'b', 'c'])
a[a == 'a'] = 1
a[a == 'b'] = 2
a[a == 'c'] = 3
a = a.astype(np.float32)

It would be better if you know what's in there and wish to set specific index for each values.

If there's only two categories, next code will work like a charm:

a = np.array( ['a', 'b', 'a', 'b'])
a = np.float32(y == 'a')

edited Jul 13, 2021 at 7:19

answered Jul 13, 2021 at 6:08

myquertykeyboard

5176 silver badges22 bronze badges

Collectives™ on Stack Overflow

numpy convert categorical string arrays to an integer array

9 Answers 9

Comments

1 Comment

Comments

Comments

2 Comments

Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

Comments

1 Comment

Comments

Comments

2 Comments

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related