Create boolean mask for every unique element in array

Question

I have a list with numbers in it. I want to create a bool mask of this list (or array, doesn't matter) for every unique element of this list.

In the example below, I want to create four masks of length len(labels). The first mask has True at position i, if labels[i]==0, the second one has True at position i, if labels[i]==1 etc.

I tried to do this with pandas and the .isin method in a loop. However, this is too slow for my purpose since this is called many times in my algorithm and the list of labels can be very long so that the loop is not effective. How can I make this faster?

labels = [0,0,1,1,3,3,3,1,2,1,0,0]
d = dict()
y = pd.Series(labels)
for i in set(labels):
    d[i] = y.isin([i])

Zero · Accepted Answer · 2017-08-08 08:30:28Z

Method 1

Using list and set

In [989]: {x: [x==l for l in labels] for x in set(labels)}
Out[989]:
{0: [True, True, False, False, False, False, False, False, False, False, True, True],
 1: [False, False, True, True, False, False, False, True, False, True, False, False],
 2: [False, False, False, False, False, False, False, False, True, False, False, False],
 3: [False, False, False, False, True, True, True, False, False, False, False, False]}

If you want it as dataframe

In [994]: pd.DataFrame({x: [x==l for l in labels] for x in set(labels)})
Out[994]:
        0      1      2      3
0    True  False  False  False
1    True  False  False  False
2   False   True  False  False
3   False   True  False  False
4   False  False  False   True
5   False  False  False   True
6   False  False  False   True
7   False   True  False  False
8   False  False   True  False
9   False   True  False  False
10   True  False  False  False
11   True  False  False  False

Method 2

Using pd.get_dummies, if you anyway a series you can

In [997]: pd.get_dummies(y).astype(bool)
Out[997]:
        0      1      2      3
0    True  False  False  False
1    True  False  False  False
2   False   True  False  False
3   False   True  False  False
4   False  False  False   True
5   False  False  False   True
6   False  False  False   True
7   False   True  False  False
8   False  False   True  False
9   False   True  False  False
10   True  False  False  False
11   True  False  False  False

Benchmarks

Small

In [1002]: len(labels)
Out[1002]: 12

In [1003]: %timeit pd.get_dummies(y).astype(bool)
1000 loops, best of 3: 476 µs per loop

In [1004]: %timeit pd.DataFrame({x: [x==l for l in labels] for x in set(labels)})
1000 loops, best of 3: 580 µs per loop

In [1005]: %timeit pd.DataFrame({x : (y == x) for x in y.unique()})
1000 loops, best of 3: 1.15 ms per loop

Large

In [1011]: len(labels)
Out[1011]: 12000

In [1012]: %timeit pd.get_dummies(y).astype(bool)
1000 loops, best of 3: 875 µs per loop

In [1013]: %timeit pd.DataFrame({x: [x==l for l in labels] for x in set(labels)})
100 loops, best of 3: 4.97 ms per loop

In [1014]: %timeit pd.DataFrame({x : (y == x) for x in y.unique()})
1000 loops, best of 3: 1.32 ms per loop

miradulo · Accepted Answer · 2017-08-08 11:25:34Z

1

You could use statsmodels.tools.tools.categorical, which ought to be rather fast, especially if you already have a NumPy array to work with.

categorical(np.array(labels), drop=True).astype(bool)

If you want an explicit mapping between each column in the resulting array and its respective label, pass dictnames=True to category.

Demo

>>> from statsmodels.tools.tools import categorical
>>> labels = np.array([0,0,1,1,3,3,3,1,2,1,0,0])
>>> categorical(labels, drop=True).astype(bool)
array([[ True, False, False, False],
       [ True, False, False, False],
       [False,  True, False, False],
       [False,  True, False, False],
       [False, False, False,  True],
       [False, False, False,  True],
       [False, False, False,  True],
       [False,  True, False, False],
       [False, False,  True, False],
       [False,  True, False, False],
       [ True, False, False, False],
       [ True, False, False, False]], dtype=bool)

>>> res, d = categorical(np.array(labels), drop=True, dictnames=True)
>>> d
{0: 0, 1: 1, 2: 2, 3: 3}

Rough benchmark (presuming already NumPy array)

Your dataset:

>>> %timeit categorical(labels, drop=True).astype(bool)
14.1 µs ± 519 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Larger dataset: labels = np.random.randint(0, 4, 10000)

%timeit categorical(labels, drop=True).astype(bool)
360 µs ± 9.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

edited Aug 8, 2017 at 11:25

answered Aug 8, 2017 at 8:52

miradulo

29.8k7 gold badges86 silver badges97 bronze badges

4 Comments

Merlin1896 Over a year ago

This does not seem to yield the desired output format. Especially, if the labels array is not continuous in the numbers, eg. labels = [1,1,0,1,5,6,12,,12,4,5].

miradulo Over a year ago

@Merlin1896 Could you please outline what is wrong with the output format in that case? To me it looks as I would expect.

Merlin1896 Over a year ago

With labels = np.array([0,0,2,3,2,12]) the output of a=categorical(labels, drop=True).astype(bool) does not give me a reference to the original label. a[:,0] is the desired output for label 0, but a[:,1] is the output for the label 2.

miradulo Over a year ago

@Merlin1896 Please see my edit, I realized there is a dictnames param to category that can help you get the mapping I think you want. If you want the inverse mapping, just use {v: k for k, v in d.items()}.

Alexander · Accepted Answer · 2017-08-08 08:42:01Z

Create an array of False values. Iterate through a groupby to get the index locations of the labels and set these to True.

d = {}
empty_labels = np.array([False] * len(labels))
for label, group in pd.DataFrame(labels, columns=['labels']).groupby('labels'):
    d[label] = empty_labels.copy()
    d[label][group] = True
>>> d
{0: array([ True, False, False, False, False, False, False, False, False,
        False, False, False], dtype=bool),
 1: array([False,  True, False, False, False, False, False, False, False,
        False, False, False], dtype=bool),
 2: array([False, False,  True, False, False, False, False, False, False,
        False, False, False], dtype=bool),
 3: array([False, False, False,  True, False, False, False, False, False,
        False, False, False], dtype=bool)}

Speed should be on par with pd.get_dummies.

Collectives™ on Stack Overflow

Create boolean mask for every unique element in array

3 Answers 3

1 Comment

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related