3

I'm having some problems with Numpy array slicing based on a boolean mask.

I can do the following masking successfully, where I select integers that are less than 10.

L1 = [1, 2, 3, 10, 20, 4]
arr = np.array(L1)
mask = arr[:] < 10
print(mask) # [ True  True  True False False  True]
print(arr[mask]) # [1 2 3 4] <-- CORRECT

The same strategy also works for slicing an array of strings to match a specific string:

L2 = ['abc', 'bac', 'foo', 'az', 'bar', 'ac']
arr = np.array(L2)
mask = arr[:] == 'foo'
print(mask) # [False False  True False False False]
print(arr[mask]) # ['foo'] <-- CORRECT

However, the slicing strategy does not work when checking a character of each string in the array. Here, I want to select strings in the array that start with the character 'a'.

L2 = ['abc', 'bac', 'foo', 'az', 'bar', 'ac']
arr = np.array(L2)
mask = arr[:][0] == 'a'
print(mask) # False
print(arr[mask]) # [] <-- WRONG

How can I create that mask correctly?

4 Answers 4

1
In [192]: alist = ['abc', 'bac', 'foo', 'az', 'bar', 'ac']
In [193]: arr = np.array(alist)

The straightforward list comprehension:

In [194]: [a[0]=='a' for a in alist]
Out[194]: [True, False, False, True, False, True]

It also works with the array, but slower (iteration on arrays is slower than on lists):

In [195]: timeit [a[0]=='a' for a in alist]
707 ns ± 5.61 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [196]: timeit [a[0]=='a' for a in arr]
4.88 µs ± 9.44 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

String startswith is also an option:

In [197]: [a.startswith('a') for a in alist]
Out[197]: [True, False, False, True, False, True]
In [198]: timeit [a.startswith('a') for a in alist]
1.14 µs ± 11.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

numpy doesn't have its own string processing tools. It has np.char functions, but they just apply python string methods, without speed improvement:

In [200]: np.char.startswith(arr, 'a')
Out[200]: array([ True, False, False,  True, False,  True])
In [201]: timeit np.char.startswith(arr, 'a')
12.5 µs ± 297 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

You get the best speed if you start and end with a list. Iterating on array, or converting the boolean list back to array takes time.

Rereading your code I see you want to select the items, not just create the mask. Then the list comprehension should be:

In [215]: [a for a in alist if a[0]=='a']
Out[215]: ['abc', 'az', 'ac']
In [216]: timeit [a for a in alist if a[0]=='a']
645 ns ± 3.36 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

====

As for your failed code

[:] does nothing for you, in any of the expressions:

In [213]: arr[:]
Out[213]: array(['abc', 'bac', 'foo', 'az', 'bar', 'ac'], dtype='<U3')

So you are just checking that the first element of arr is 'a'`. You aren't testing the first string of each element.

In [214]: arr[0]=='a'
Out[214]: False
Sign up to request clarification or add additional context in comments.

2 Comments

I think building a mask with a list comprehension is the most generalizable approach. Using a specialized function like np.char.startswith() would be difficult to remember.
Thanks for also explaining why arr[:][0] == 'a' didn't work in my code.
1

Use numpy.char.startswith:

arr[np.char.startswith(arr, "a")]

Output:

array(['abc', 'az', 'ac'], dtype='<U3')

Note that this, by default, uses the first index (i.e. 0). Use can use start parameter to act as indexing:

arr[np.char.startswith(arr, "a", 1)]

Output:

array(['bac', 'bar'], dtype='<U3')

Comments

1

You can use simple List Comprehension:

In [3403]: L2 = ['abc', 'bac', 'foo', 'az', 'bar', 'ac']

In [3404]: arr = np.array(L2)  

In [3411]: res = np.array([i for i in arr if i.startswith('a')])

In [3412]: res
Out[3412]: array(['abc', 'az', 'ac'], dtype='<U3')

OR, if you want to use mask:

Use np.char.startswith

In [3415]: mask = np.char.startswith(arr, 'a')
In [3417]: print(arr[mask])
['abc' 'az' 'ac']

1 Comment

Thanks. The list comprehension approach is the more intuitive one.
0

Starting in numpy 1.23.0, you will be able to slice strings in an array. The following will work, without making any copies:

L2 = np.array(['abc', 'bac', 'foo', 'az', 'bar', 'ac'])
mask = L2[:, None].view('U1')[:, 0] == 'a'
print(mask)      # True, False, False, True, False, True
print(arr[mask]) # 'abc', 'az', 'ac'

I'm writing an even simpler method, built on this change (here) that will allow the following:

mask = np.char.slice_(L2, stop=1) == 'a'

Again, fully vectorized and without copying.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.