19

Numpy has some very useful string operations, which vectorize the usual Python string operations.

Compared to these operation and to pandas.str, the numpy strings module seems to be missing a very important one: the ability to slice each string in the array. For example,

a = numpy.array(['hello', 'how', 'are', 'you'])
numpy.char.sliceStr(a, slice(1, 3))
>>> numpy.array(['el', 'ow', 're' 'ou'])

Am I missing some obvious method in the module with this functionality? Otherwise, is there a fast vectorized way to achieve this?

4
  • 2
    i am not sure what the question is. who is missing this feature? Commented Aug 19, 2016 at 15:03
  • Numpy's routines.char seems to be missing it. I edited the question to make this more clear. Commented Aug 19, 2016 at 15:09
  • I've looked for this functionality as well. I think I've always ended up using some kind of loop. Commented Aug 19, 2016 at 15:32
  • and how about variable slicing? E.g., the vectorized equivalent of [a[s:e] for a, s, e in zip(a_array, s_array, e_array)]? I was hoping that the numpy.strings operations would provide something like that. I am trying to remove prefixes from strings, and numpy.strings.startswith(a, b) is so tantalizing close -- I'd just need to clip those prefixes in b when the strings in a start with them... Commented May 22 at 2:45

6 Answers 6

19

Here's a vectorized approach -

def slicer_vectorized(a,start,end):
    b = a.view((str,1)).reshape(len(a),-1)[:,start:end]
    return np.fromstring(b.tostring(),dtype=(str,end-start))

Sample run -

In [68]: a = np.array(['hello', 'how', 'are', 'you'])

In [69]: slicer_vectorized(a,1,3)
Out[69]: 
array(['el', 'ow', 're', 'ou'], 
      dtype='|S2')

In [70]: slicer_vectorized(a,0,3)
Out[70]: 
array(['hel', 'how', 'are', 'you'], 
      dtype='|S3')

Runtime test -

Testing out all the approaches posted by other authors that I could run at my end and also including the vectorized approach from earlier in this post.

Here's the timings -

In [53]: # Setup input array
    ...: a = np.array(['hello', 'how', 'are', 'you'])
    ...: a = np.repeat(a,10000)
    ...: 

# @Alberto Garcia-Raboso's answer
In [54]: %timeit slicer(1, 3)(a)
10 loops, best of 3: 23.5 ms per loop

# @hapaulj's answer
In [55]: %timeit np.frompyfunc(lambda x:x[1:3],1,1)(a)
100 loops, best of 3: 11.6 ms per loop

# Using loop-comprehension
In [56]: %timeit np.array([i[1:3] for i in a])
100 loops, best of 3: 12.1 ms per loop

# From this post
In [57]: %timeit slicer_vectorized(a,1,3)
1000 loops, best of 3: 787 µs per loop
Sign up to request clarification or add additional context in comments.

5 Comments

Was this tested with Python 2.x or 3.x? With 3.5.2 I get array([b'', b'', b'', b''], dtype='|S2') as output?
@Bart This was with Python 2.7. Will see if the issue is there with Python 3.x. Thanks for notifying.
@Bart: I edited the code in this answer slightly to make it compatible with Python 3. The problem was the used of 'S' as the dtype, while Python 3 uses Unicode everywhere, so 'U'. Now it uses str as the dtype so it should work on all versions.
Nice trick. You can get almost 2x faster by using .view() again instead of .tostring()/fromstring. The second line can be replaced by: return numpy.array(b).view((str,end-start)).flatten()
For NumPy's Version 2.3.0 or above, use numpy.strings.slice for faster and easier vectorized slicing. For Reference: github.com/numpy/numpy/pull/27789 Usage: numpy.strings.slice(array, start, stop=None, step=None)
4

Most, if not all the functions in np.char apply existing str methods to each element of the array. It's a little faster than direct iteration (or vectorize) but not drastically so.

There isn't a string slicer; at least not by that sort of name. Closest is indexing with a slice:

In [274]: 'astring'[1:3]
Out[274]: 'st'
In [275]: 'astring'.__getitem__
Out[275]: <method-wrapper '__getitem__' of str object at 0xb3866c20>
In [276]: 'astring'.__getitem__(slice(1,4))
Out[276]: 'str'

An iterative approach can be with frompyfunc (which is also used by vectorize):

In [277]: a = numpy.array(['hello', 'how', 'are', 'you'])
In [278]: np.frompyfunc(lambda x:x[1:3],1,1)(a)
Out[278]: array(['el', 'ow', 're', 'ou'], dtype=object)
In [279]: np.frompyfunc(lambda x:x[1:3],1,1)(a).astype('U2')
Out[279]: 
array(['el', 'ow', 're', 'ou'], 
      dtype='<U2')

I could view it as a single character array, and slice that

In [289]: a.view('U1').reshape(4,-1)[:,1:3]
Out[289]: 
array([['e', 'l'],
       ['o', 'w'],
       ['r', 'e'],
       ['o', 'u']], 
      dtype='<U1')

I still need to figure out how to convert it back to 'U2'.

In [290]: a.view('U1').reshape(4,-1)[:,1:3].copy().view('U2')
Out[290]: 
array([['el'],
       ['ow'],
       ['re'],
       ['ou']], 
      dtype='<U2')

The initial view step shows the databuffer as Py3 characters (these would be bytes in a S or Py2 string case):

In [284]: a.view('U1')
Out[284]: 
array(['h', 'e', 'l', 'l', 'o', 'h', 'o', 'w', '', '', 'a', 'r', 'e', '',
       '', 'y', 'o', 'u', '', ''], 
      dtype='<U1')

Picking the 1:3 columns amounts to picking a.view('U1')[[1,2,6,7,11,12,16,17]] and then reshaping and view. Without getting into details, I'm not surprised that it requires a copy.

3 Comments

a.view('U1').reshape(len(a),-1)[:,1:3].astype(object).sum(axis=1) works and is the clear winner in terms of performance --- see my answer.
It looks like you can convert your array from U1 to U2 with view if you make a copy first, but we shouldn't need the copy. In principle, this should just be a simple manipulation of strides, dimensions, and offsets, but I don't know if there's a way to do it like that without directly applying C routines.
tricky use of .sum(). In this case it's the string + string concatenation.
3

To solve this, so far I've been transforming the numpy array to a pandas Series and back. It is not a pretty solution, but it works and it works relatively fast.

a = numpy.array(['hello', 'how', 'are', 'you'])
pandas.Series(a).str[1:3].values
array(['el', 'ow', 're', 'ou'], dtype=object)

2 Comments

Actually when I timed it on a large array pandas was not faster for regular slicing (something like .str[1:5]). I even excluded the time for converting the array to series. It was faster for things like .str[::-1] though.
Pandas appears to use a lot of object arrays.
3

Interesting omission... I guess you can always write your own:

import numpy as np

def slicer(start=None, stop=None, step=1):
    return np.vectorize(lambda x: x[start:stop:step], otypes=[str])

a = np.array(['hello', 'how', 'are', 'you'])
print(slicer(1, 3)(a))    # => ['el' 'ow' 're' 'ou']

EDIT: Here are some benchmarks using the text of Ulysses by James Joyce. It seems the clear winner is @hpaulj's last strategy. @Divakar gets into the race improving on @hpaulj's last strategy.

import numpy as np
import requests

ulysses = requests.get('http://www.gutenberg.org/files/4300/4300-0.txt').text
a = np.array(ulysses.split())

# Ufunc
def slicer(start=None, stop=None, step=1):
    return np.vectorize(lambda x: x[start:stop:step], otypes=[str])

%timeit slicer(1, 3)(a)
# => 1 loop, best of 3: 221 ms per loop

# Non-mutating loop
def loop1(a):
    out = np.empty(len(a), dtype=object)
    for i, word in enumerate(a):
        out[i] = word[1:3]

%timeit loop1(a)
# => 1 loop, best of 3: 262 ms per loop

# Mutating loop
def loop2(a):
    for i in range(len(a)):
        a[i] = a[i][1:3]

b = a.copy()
%timeit -n 1 -r 1 loop2(b)
# 1 loop, best of 1: 285 ms per loop

# From @hpaulj's answer
%timeit np.frompyfunc(lambda x:x[1:3],1,1)(a)
# => 10 loops, best of 3: 141 ms per loop

%timeit np.frompyfunc(lambda x:x[1:3],1,1)(a).astype('U2')
# => 1 loop, best of 3: 170 ms per loop

%timeit a.view('U1').reshape(len(a),-1)[:,1:3].astype(object).sum(axis=1)
# => 10 loops, best of 3: 60.7 ms per loop

def slicer_vectorized(a,start,end):
    b = a.view('S1').reshape(len(a),-1)[:,start:end]
    return np.fromstring(b.tostring(),dtype='S'+str(end-start))

%timeit slicer_vectorized(a,1,3)
# => The slowest run took 5.34 times longer than the fastest.
#    This could mean that an intermediate result is being cached.
#    10 loops, best of 3: 16.8 ms per loop

2 Comments

This is slower than a regular loop though.
It is slightly faster than a loop on my machine (with a large array).
2

I completely agree that this is an omission, which is why I opened up PR #20694. If that gets accepted, you will be able to do exactly what you propose, but under the slightly more conventional name of np.char.slice_:

>>> a = np.array(['hello', 'how', 'are', 'you'])
>>> np.char.slice_(a, 1, 3)
array(['el', 'ow', 're' 'ou'])

The code in the PR is fully functional, so you can make a working copy of it, but it uses a couple of hacks to get around some limitations.

For this simple case, you can use simple slicing. Starting with numpy 1.23.0, you can view non-contiguous arrays under a dtype of different size (PR #20722). That means you can do

>>> a[:, None].view('U1')[:, 1:3].view('U2').squeeze()
array(['el', 'ow', 're' 'ou'])

2 Comments

Thank you! This is great.
@MartínFixman. Thanks. It's been bothering me too. I'll let you know when it gets accepted :)
1

Starting with NumPy 2.3.0, you can use numpy.strings.slice to slice each string in an array, just like regular Python string slices, but fully vectorized and supporting broadcasting.

Example usage:

a = np.array(['hello', 'how', 'are', 'you'])
result = np.strings.slice(a, 1, 3)
print(result)
# Output: ['el' 'ow' 're' 'ou']

You may also specify different slices per element:

a = np.array(['hello', 'world'])
start = np.array([1, 2])
stop =  np.array([4, 5])
result = np.strings.slice(a, start, stop)
print(result)
# Output: ['ell', 'rld']

Like standard Python slicing, negative values and steps are supported:

a = np.array(['hello world', 'foo bar', 'python rulez', 'slice me'])
# Reverse each string in the array
result = np.strings.slice(a, None, None, -1)
print(result)
# Output: ['dlrow olleh' 'rab oof' 'zelur nohtyp' 'em ecils']

This method is now the preferred way to slice strings in numpy arrays (requires NumPy ≥ 2.3.0, Python ≥ 3.11).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.