Slicing a Numpy array based on characters of string elements

Question

I'm having some problems with Numpy array slicing based on a boolean mask.

I can do the following masking successfully, where I select integers that are less than 10.

L1 = [1, 2, 3, 10, 20, 4]
arr = np.array(L1)
mask = arr[:] < 10
print(mask) # [ True  True  True False False  True]
print(arr[mask]) # [1 2 3 4] <-- CORRECT

The same strategy also works for slicing an array of strings to match a specific string:

L2 = ['abc', 'bac', 'foo', 'az', 'bar', 'ac']
arr = np.array(L2)
mask = arr[:] == 'foo'
print(mask) # [False False  True False False False]
print(arr[mask]) # ['foo'] <-- CORRECT

However, the slicing strategy does not work when checking a character of each string in the array. Here, I want to select strings in the array that start with the character 'a'.

L2 = ['abc', 'bac', 'foo', 'az', 'bar', 'ac']
arr = np.array(L2)
mask = arr[:][0] == 'a'
print(mask) # False
print(arr[mask]) # [] <-- WRONG

How can I create that mask correctly?

hpaulj · Accepted Answer · 2020-11-24 17:29:56Z

1

In [192]: alist = ['abc', 'bac', 'foo', 'az', 'bar', 'ac']
In [193]: arr = np.array(alist)

The straightforward list comprehension:

In [194]: [a[0]=='a' for a in alist]
Out[194]: [True, False, False, True, False, True]

It also works with the array, but slower (iteration on arrays is slower than on lists):

In [195]: timeit [a[0]=='a' for a in alist]
707 ns ± 5.61 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [196]: timeit [a[0]=='a' for a in arr]
4.88 µs ± 9.44 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

String startswith is also an option:

In [197]: [a.startswith('a') for a in alist]
Out[197]: [True, False, False, True, False, True]
In [198]: timeit [a.startswith('a') for a in alist]
1.14 µs ± 11.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

numpy doesn't have its own string processing tools. It has np.char functions, but they just apply python string methods, without speed improvement:

In [200]: np.char.startswith(arr, 'a')
Out[200]: array([ True, False, False,  True, False,  True])
In [201]: timeit np.char.startswith(arr, 'a')
12.5 µs ± 297 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

You get the best speed if you start and end with a list. Iterating on array, or converting the boolean list back to array takes time.

Rereading your code I see you want to select the items, not just create the mask. Then the list comprehension should be:

In [215]: [a for a in alist if a[0]=='a']
Out[215]: ['abc', 'az', 'ac']
In [216]: timeit [a for a in alist if a[0]=='a']
645 ns ± 3.36 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

====

As for your failed code

[:] does nothing for you, in any of the expressions:

In [213]: arr[:]
Out[213]: array(['abc', 'bac', 'foo', 'az', 'bar', 'ac'], dtype='<U3')

So you are just checking that the first element of arr is 'a'`. You aren't testing the first string of each element.

In [214]: arr[0]=='a'
Out[214]: False

edited Nov 24, 2020 at 17:29

answered Nov 24, 2020 at 17:22

hpaulj

233k14 gold badges260 silver badges392 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

stackoverflowuser2010 Over a year ago

I think building a mask with a list comprehension is the most generalizable approach. Using a specialized function like np.char.startswith() would be difficult to remember.

stackoverflowuser2010 Over a year ago

Thanks for also explaining why arr[:][0] == 'a' didn't work in my code.

Chris · Accepted Answer · 2020-11-24 07:13:10Z

1

Use numpy.char.startswith:

arr[np.char.startswith(arr, "a")]

Output:

array(['abc', 'az', 'ac'], dtype='<U3')

Note that this, by default, uses the first index (i.e. 0). Use can use start parameter to act as indexing:

arr[np.char.startswith(arr, "a", 1)]

Output:

array(['bac', 'bar'], dtype='<U3')

answered Nov 24, 2020 at 7:13

Chris

29.8k3 gold badges34 silver badges56 bronze badges

Comments

Mayank Porwal · Accepted Answer · 2020-11-24 07:29:12Z

1

You can use simple List Comprehension:

In [3403]: L2 = ['abc', 'bac', 'foo', 'az', 'bar', 'ac']

In [3404]: arr = np.array(L2)  

In [3411]: res = np.array([i for i in arr if i.startswith('a')])

In [3412]: res
Out[3412]: array(['abc', 'az', 'ac'], dtype='<U3')

OR, if you want to use mask:

Use np.char.startswith

In [3415]: mask = np.char.startswith(arr, 'a')
In [3417]: print(arr[mask])
['abc' 'az' 'ac']

edited Nov 24, 2020 at 7:29

answered Nov 24, 2020 at 7:23

Mayank Porwal

34.2k9 gold badges45 silver badges65 bronze badges

1 Comment

stackoverflowuser2010 Over a year ago

Thanks. The list comprehension approach is the more intuitive one.

Mad Physicist · Accepted Answer · 2022-01-07 07:16:47Z

0

Starting in numpy 1.23.0, you will be able to slice strings in an array. The following will work, without making any copies:

L2 = np.array(['abc', 'bac', 'foo', 'az', 'bar', 'ac'])
mask = L2[:, None].view('U1')[:, 0] == 'a'
print(mask)      # True, False, False, True, False, True
print(arr[mask]) # 'abc', 'az', 'ac'

I'm writing an even simpler method, built on this change (here) that will allow the following:

mask = np.char.slice_(L2, stop=1) == 'a'

Again, fully vectorized and without copying.

answered Jan 7, 2022 at 7:16

Mad Physicist

116k29 gold badges202 silver badges292 bronze badges

Collectives™ on Stack Overflow

Slicing a Numpy array based on characters of string elements

4 Answers 4

2 Comments

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related