3

I have a list with numbers in it. I want to create a bool mask of this list (or array, doesn't matter) for every unique element of this list.

In the example below, I want to create four masks of length len(labels). The first mask has True at position i, if labels[i]==0, the second one has True at position i, if labels[i]==1 etc.

I tried to do this with pandas and the .isin method in a loop. However, this is too slow for my purpose since this is called many times in my algorithm and the list of labels can be very long so that the loop is not effective. How can I make this faster?

labels = [0,0,1,1,3,3,3,1,2,1,0,0]
d = dict()
y = pd.Series(labels)
for i in set(labels):
    d[i] = y.isin([i])
0

3 Answers 3

4

Method 1

Using list and set

In [989]: {x: [x==l for l in labels] for x in set(labels)}
Out[989]:
{0: [True, True, False, False, False, False, False, False, False, False, True, True],
 1: [False, False, True, True, False, False, False, True, False, True, False, False],
 2: [False, False, False, False, False, False, False, False, True, False, False, False],
 3: [False, False, False, False, True, True, True, False, False, False, False, False]}

If you want it as dataframe

In [994]: pd.DataFrame({x: [x==l for l in labels] for x in set(labels)})
Out[994]:
        0      1      2      3
0    True  False  False  False
1    True  False  False  False
2   False   True  False  False
3   False   True  False  False
4   False  False  False   True
5   False  False  False   True
6   False  False  False   True
7   False   True  False  False
8   False  False   True  False
9   False   True  False  False
10   True  False  False  False
11   True  False  False  False

Method 2

Using pd.get_dummies, if you anyway a series you can

In [997]: pd.get_dummies(y).astype(bool)
Out[997]:
        0      1      2      3
0    True  False  False  False
1    True  False  False  False
2   False   True  False  False
3   False   True  False  False
4   False  False  False   True
5   False  False  False   True
6   False  False  False   True
7   False   True  False  False
8   False  False   True  False
9   False   True  False  False
10   True  False  False  False
11   True  False  False  False

Benchmarks

Small

In [1002]: len(labels)
Out[1002]: 12

In [1003]: %timeit pd.get_dummies(y).astype(bool)
1000 loops, best of 3: 476 µs per loop

In [1004]: %timeit pd.DataFrame({x: [x==l for l in labels] for x in set(labels)})
1000 loops, best of 3: 580 µs per loop

In [1005]: %timeit pd.DataFrame({x : (y == x) for x in y.unique()})
1000 loops, best of 3: 1.15 ms per loop

Large

In [1011]: len(labels)
Out[1011]: 12000

In [1012]: %timeit pd.get_dummies(y).astype(bool)
1000 loops, best of 3: 875 µs per loop

In [1013]: %timeit pd.DataFrame({x: [x==l for l in labels] for x in set(labels)})
100 loops, best of 3: 4.97 ms per loop

In [1014]: %timeit pd.DataFrame({x : (y == x) for x in y.unique()})
1000 loops, best of 3: 1.32 ms per loop
Sign up to request clarification or add additional context in comments.

1 Comment

Is this way of writing a for loop faster than my way?
1

You could use statsmodels.tools.tools.categorical, which ought to be rather fast, especially if you already have a NumPy array to work with.

categorical(np.array(labels), drop=True).astype(bool)

If you want an explicit mapping between each column in the resulting array and its respective label, pass dictnames=True to category.

Demo

>>> from statsmodels.tools.tools import categorical
>>> labels = np.array([0,0,1,1,3,3,3,1,2,1,0,0])
>>> categorical(labels, drop=True).astype(bool)
array([[ True, False, False, False],
       [ True, False, False, False],
       [False,  True, False, False],
       [False,  True, False, False],
       [False, False, False,  True],
       [False, False, False,  True],
       [False, False, False,  True],
       [False,  True, False, False],
       [False, False,  True, False],
       [False,  True, False, False],
       [ True, False, False, False],
       [ True, False, False, False]], dtype=bool)

>>> res, d = categorical(np.array(labels), drop=True, dictnames=True)
>>> d
{0: 0, 1: 1, 2: 2, 3: 3}

Rough benchmark (presuming already NumPy array)

Your dataset:

>>> %timeit categorical(labels, drop=True).astype(bool)
14.1 µs ± 519 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Larger dataset: labels = np.random.randint(0, 4, 10000)

%timeit categorical(labels, drop=True).astype(bool)
360 µs ± 9.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

4 Comments

This does not seem to yield the desired output format. Especially, if the labels array is not continuous in the numbers, eg. labels = [1,1,0,1,5,6,12,,12,4,5].
@Merlin1896 Could you please outline what is wrong with the output format in that case? To me it looks as I would expect.
With labels = np.array([0,0,2,3,2,12]) the output of a=categorical(labels, drop=True).astype(bool) does not give me a reference to the original label. a[:,0] is the desired output for label 0, but a[:,1] is the output for the label 2.
@Merlin1896 Please see my edit, I realized there is a dictnames param to category that can help you get the mapping I think you want. If you want the inverse mapping, just use {v: k for k, v in d.items()}.
0

Create an array of False values. Iterate through a groupby to get the index locations of the labels and set these to True.

d = {}
empty_labels = np.array([False] * len(labels))
for label, group in pd.DataFrame(labels, columns=['labels']).groupby('labels'):
    d[label] = empty_labels.copy()
    d[label][group] = True
>>> d
{0: array([ True, False, False, False, False, False, False, False, False,
        False, False, False], dtype=bool),
 1: array([False,  True, False, False, False, False, False, False, False,
        False, False, False], dtype=bool),
 2: array([False, False,  True, False, False, False, False, False, False,
        False, False, False], dtype=bool),
 3: array([False, False, False,  True, False, False, False, False, False,
        False, False, False], dtype=bool)}

Speed should be on par with pd.get_dummies.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.