How can I slice each element of a numpy array of strings?

Question

Numpy has some very useful string operations, which vectorize the usual Python string operations.

Compared to these operation and to pandas.str, the numpy strings module seems to be missing a very important one: the ability to slice each string in the array. For example,

a = numpy.array(['hello', 'how', 'are', 'you'])
numpy.char.sliceStr(a, slice(1, 3))
>>> numpy.array(['el', 'ow', 're' 'ou'])

Am I missing some obvious method in the module with this functionality? Otherwise, is there a fast vectorized way to achieve this?

i am not sure what the question is. who is missing this feature? — Ma0
– Ma0, Commented Aug 19, 2016 at 15:03
Numpy's routines.char seems to be missing it. I edited the question to make this more clear. — Martín Fixman
– Martín Fixman, Commented Aug 19, 2016 at 15:09
I've looked for this functionality as well. I think I've always ended up using some kind of loop. — farenorth
– farenorth, Commented Aug 19, 2016 at 15:32
and how about variable slicing? E.g., the vectorized equivalent of [a[s:e] for a, s, e in zip(a_array, s_array, e_array)]? I was hoping that the numpy.strings operations would provide something like that. I am trying to remove prefixes from strings, and numpy.strings.startswith(a, b) is so tantalizing close -- I'd just need to clip those prefixes in b when the strings in a start with them... — Pierre D
– Pierre D, Commented May 22 at 2:45

John Zwinck · Accepted Answer · 2018-06-30 01:01:09Z

19

Here's a vectorized approach -

def slicer_vectorized(a,start,end):
    b = a.view((str,1)).reshape(len(a),-1)[:,start:end]
    return np.fromstring(b.tostring(),dtype=(str,end-start))

Sample run -

In [68]: a = np.array(['hello', 'how', 'are', 'you'])

In [69]: slicer_vectorized(a,1,3)
Out[69]: 
array(['el', 'ow', 're', 'ou'], 
      dtype='|S2')

In [70]: slicer_vectorized(a,0,3)
Out[70]: 
array(['hel', 'how', 'are', 'you'], 
      dtype='|S3')

Runtime test -

Testing out all the approaches posted by other authors that I could run at my end and also including the vectorized approach from earlier in this post.

Here's the timings -

In [53]: # Setup input array
    ...: a = np.array(['hello', 'how', 'are', 'you'])
    ...: a = np.repeat(a,10000)
    ...: 

# @Alberto Garcia-Raboso's answer
In [54]: %timeit slicer(1, 3)(a)
10 loops, best of 3: 23.5 ms per loop

# @hapaulj's answer
In [55]: %timeit np.frompyfunc(lambda x:x[1:3],1,1)(a)
100 loops, best of 3: 11.6 ms per loop

# Using loop-comprehension
In [56]: %timeit np.array([i[1:3] for i in a])
100 loops, best of 3: 12.1 ms per loop

# From this post
In [57]: %timeit slicer_vectorized(a,1,3)
1000 loops, best of 3: 787 µs per loop

edited Jun 30, 2018 at 1:01

John Zwinck

252k44 gold badges346 silver badges459 bronze badges

answered Aug 19, 2016 at 18:09

Divakar

222k19 gold badges273 silver badges374 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Bart Over a year ago

Was this tested with Python 2.x or 3.x? With 3.5.2 I get array([b'', b'', b'', b''], dtype='|S2') as output?

Divakar Over a year ago

@Bart This was with Python 2.7. Will see if the issue is there with Python 3.x. Thanks for notifying.

John Zwinck Over a year ago

@Bart: I edited the code in this answer slightly to make it compatible with Python 3. The problem was the used of 'S' as the dtype, while Python 3 uses Unicode everywhere, so 'U'. Now it uses str as the dtype so it should work on all versions.

Emile Over a year ago

Nice trick. You can get almost 2x faster by using .view() again instead of .tostring()/fromstring. The second line can be replaced by: return numpy.array(b).view((str,end-start)).flatten()

Hamna Rashid Jun 17 at 12:07

For NumPy's Version 2.3.0 or above, use numpy.strings.slice for faster and easier vectorized slicing. For Reference: github.com/numpy/numpy/pull/27789 Usage: numpy.strings.slice(array, start, stop=None, step=None)

hpaulj · Accepted Answer · 2016-08-19 19:59:45Z

4

Most, if not all the functions in np.char apply existing str methods to each element of the array. It's a little faster than direct iteration (or vectorize) but not drastically so.

There isn't a string slicer; at least not by that sort of name. Closest is indexing with a slice:

In [274]: 'astring'[1:3]
Out[274]: 'st'
In [275]: 'astring'.__getitem__
Out[275]: <method-wrapper '__getitem__' of str object at 0xb3866c20>
In [276]: 'astring'.__getitem__(slice(1,4))
Out[276]: 'str'

An iterative approach can be with frompyfunc (which is also used by vectorize):

In [277]: a = numpy.array(['hello', 'how', 'are', 'you'])
In [278]: np.frompyfunc(lambda x:x[1:3],1,1)(a)
Out[278]: array(['el', 'ow', 're', 'ou'], dtype=object)
In [279]: np.frompyfunc(lambda x:x[1:3],1,1)(a).astype('U2')
Out[279]: 
array(['el', 'ow', 're', 'ou'], 
      dtype='<U2')

I could view it as a single character array, and slice that

In [289]: a.view('U1').reshape(4,-1)[:,1:3]
Out[289]: 
array([['e', 'l'],
       ['o', 'w'],
       ['r', 'e'],
       ['o', 'u']], 
      dtype='<U1')

I still need to figure out how to convert it back to 'U2'.

In [290]: a.view('U1').reshape(4,-1)[:,1:3].copy().view('U2')
Out[290]: 
array([['el'],
       ['ow'],
       ['re'],
       ['ou']], 
      dtype='<U2')

The initial view step shows the databuffer as Py3 characters (these would be bytes in a S or Py2 string case):

In [284]: a.view('U1')
Out[284]: 
array(['h', 'e', 'l', 'l', 'o', 'h', 'o', 'w', '', '', 'a', 'r', 'e', '',
       '', 'y', 'o', 'u', '', ''], 
      dtype='<U1')

Picking the 1:3 columns amounts to picking a.view('U1')[[1,2,6,7,11,12,16,17]] and then reshaping and view. Without getting into details, I'm not surprised that it requires a copy.

edited Aug 19, 2016 at 19:59

answered Aug 19, 2016 at 15:55

hpaulj

233k14 gold badges260 silver badges392 bronze badges

3 Comments

Alicia Garcia-Raboso Over a year ago

a.view('U1').reshape(len(a),-1)[:,1:3].astype(object).sum(axis=1) works and is the clear winner in terms of performance --- see my answer.

user2357112 Over a year ago

It looks like you can convert your array from U1 to U2 with view if you make a copy first, but we shouldn't need the copy. In principle, this should just be a simple manipulation of strides, dimensions, and offsets, but I don't know if there's a way to do it like that without directly applying C routines.

hpaulj Over a year ago

tricky use of .sum(). In this case it's the string + string concatenation.

Martín Fixman · Accepted Answer · 2016-08-19 18:15:38Z

3

To solve this, so far I've been transforming the numpy array to a pandas Series and back. It is not a pretty solution, but it works and it works relatively fast.

a = numpy.array(['hello', 'how', 'are', 'you'])
pandas.Series(a).str[1:3].values
array(['el', 'ow', 're', 'ou'], dtype=object)

answered Aug 19, 2016 at 18:15

Martín Fixman

9,72511 gold badges40 silver badges46 bronze badges

2 Comments

user2285236 Over a year ago

Actually when I timed it on a large array pandas was not faster for regular slicing (something like .str[1:5]). I even excluded the time for converting the array to series. It was faster for things like .str[::-1] though.

hpaulj Over a year ago

Pandas appears to use a lot of object arrays.

Alicia Garcia-Raboso · Accepted Answer · 2016-08-19 18:20:34Z

3

Interesting omission... I guess you can always write your own:

import numpy as np

def slicer(start=None, stop=None, step=1):
    return np.vectorize(lambda x: x[start:stop:step], otypes=[str])

a = np.array(['hello', 'how', 'are', 'you'])
print(slicer(1, 3)(a))    # => ['el' 'ow' 're' 'ou']

EDIT: Here are some benchmarks using the text of Ulysses by James Joyce. ~~It seems the clear winner is @hpaulj's last strategy.~~ @Divakar gets into the race improving on @hpaulj's last strategy.

import numpy as np
import requests

ulysses = requests.get('http://www.gutenberg.org/files/4300/4300-0.txt').text
a = np.array(ulysses.split())

# Ufunc
def slicer(start=None, stop=None, step=1):
    return np.vectorize(lambda x: x[start:stop:step], otypes=[str])

%timeit slicer(1, 3)(a)
# => 1 loop, best of 3: 221 ms per loop

# Non-mutating loop
def loop1(a):
    out = np.empty(len(a), dtype=object)
    for i, word in enumerate(a):
        out[i] = word[1:3]

%timeit loop1(a)
# => 1 loop, best of 3: 262 ms per loop

# Mutating loop
def loop2(a):
    for i in range(len(a)):
        a[i] = a[i][1:3]

b = a.copy()
%timeit -n 1 -r 1 loop2(b)
# 1 loop, best of 1: 285 ms per loop

# From @hpaulj's answer
%timeit np.frompyfunc(lambda x:x[1:3],1,1)(a)
# => 10 loops, best of 3: 141 ms per loop

%timeit np.frompyfunc(lambda x:x[1:3],1,1)(a).astype('U2')
# => 1 loop, best of 3: 170 ms per loop

%timeit a.view('U1').reshape(len(a),-1)[:,1:3].astype(object).sum(axis=1)
# => 10 loops, best of 3: 60.7 ms per loop

def slicer_vectorized(a,start,end):
    b = a.view('S1').reshape(len(a),-1)[:,start:end]
    return np.fromstring(b.tostring(),dtype='S'+str(end-start))

%timeit slicer_vectorized(a,1,3)
# => The slowest run took 5.34 times longer than the fastest.
#    This could mean that an intermediate result is being cached.
#    10 loops, best of 3: 16.8 ms per loop

edited Aug 19, 2016 at 18:20

answered Aug 19, 2016 at 15:35

Alicia Garcia-Raboso

14k1 gold badge47 silver badges48 bronze badges

2 Comments

user2285236 Over a year ago

This is slower than a regular loop though.

Alicia Garcia-Raboso Over a year ago

It is slightly faster than a loop on my machine (with a large array).

Mad Physicist · Accepted Answer · 2022-01-07 07:09:54Z

2

I completely agree that this is an omission, which is why I opened up PR #20694. If that gets accepted, you will be able to do exactly what you propose, but under the slightly more conventional name of np.char.slice_:

>>> a = np.array(['hello', 'how', 'are', 'you'])
>>> np.char.slice_(a, 1, 3)
array(['el', 'ow', 're' 'ou'])

The code in the PR is fully functional, so you can make a working copy of it, but it uses a couple of hacks to get around some limitations.

For this simple case, you can use simple slicing. Starting with numpy 1.23.0, you can view non-contiguous arrays under a dtype of different size (PR #20722). That means you can do

>>> a[:, None].view('U1')[:, 1:3].view('U2').squeeze()
array(['el', 'ow', 're' 'ou'])

answered Jan 7, 2022 at 7:09

Mad Physicist

116k29 gold badges202 silver badges292 bronze badges

2 Comments

Martín Fixman Over a year ago

Thank you! This is great.

Mad Physicist Over a year ago

@MartínFixman. Thanks. It's been bothering me too. I'll let you know when it gets accepted :)

Riccardo Bucco · Accepted Answer · 2025-06-20 16:36:31Z

Starting with NumPy 2.3.0, you can use numpy.strings.slice to slice each string in an array, just like regular Python string slices, but fully vectorized and supporting broadcasting.

Example usage:

a = np.array(['hello', 'how', 'are', 'you'])
result = np.strings.slice(a, 1, 3)
print(result)
# Output: ['el' 'ow' 're' 'ou']

You may also specify different slices per element:

a = np.array(['hello', 'world'])
start = np.array([1, 2])
stop =  np.array([4, 5])
result = np.strings.slice(a, start, stop)
print(result)
# Output: ['ell', 'rld']

Like standard Python slicing, negative values and steps are supported:

a = np.array(['hello world', 'foo bar', 'python rulez', 'slice me'])
# Reverse each string in the array
result = np.strings.slice(a, None, None, -1)
print(result)
# Output: ['dlrow olleh' 'rab oof' 'zelur nohtyp' 'em ecils']

This method is now the preferred way to slice strings in numpy arrays (requires NumPy ≥ 2.3.0, Python ≥ 3.11).

Collectives™ on Stack Overflow

How can I slice each element of a numpy array of strings?

6 Answers 6

5 Comments

3 Comments

2 Comments

2 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

5 Comments

3 Comments

2 Comments

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related