19

Is there any builtin operation in NumPy that returns the length of each string in an array?

I don't think any of the NumPy string operations does that, is this correct?

I can do it with a for loop, but maybe there's something more efficient?

import numpy as np
arr = np.array(['Hello', 'foo', 'and', 'whatsoever'], dtype='S256')

sizes = []
for i in arr:
    sizes.append(len(i))

print(sizes)
[5, 3, 3, 10]
1
  • 1
    For modest size arrays, the list comprehension equivalent is good: [len(i) for i in arr]. The np.char functions aren't speedy either, since they still have to apply string methods to each element. Don't Commented Jun 16, 2017 at 17:13

4 Answers 4

26

You can use vectorize of numpy. It is much faster.

mylen = np.vectorize(len)
print mylen(arr)
Sign up to request clarification or add additional context in comments.

5 Comments

In my timings mylen is noticeably slower than the list comprehension for this small example array, and barely faster for one that is 1000x large. vectorize does not promise speed. It does make iterating over all elements of a multidimensional array easier.
@hpaulj i was referring to for loop. It is faster than normal for and if data is huge, it is still faster than list comprehension. Data matters. :)
Comparing to pandas, this was faster than df['s'].str.len()
Is there no one-liner for this standard operation in Numpy? np.vectorize(len)(arr) is weird.
The reason np.vectorize is not faster is because np.vectorize just runs a Python for loop—it does not compile. This deficiency is, in fact, why Numba was written. Here's Travis Oliphant (author of NumPy) saying exactly that: youtu.be/QpaapVaL8Fw?t=1621
17

UPDATE 06/20: Cater for u+0000 character and non-contiguous inputs - thanks @ M1L0U

Here is a comparison of a couple of methods.

Observations:

  • For input size >1000 lines, viewcasting + argmax is consistently and by a large margin fastest.
  • Python solutions profit from converting the array to a list first.
  • map beats list comprehension
  • np.frompyfunc and to a lesser degree np.vectorize fare better than their reputation

.

 contiguous
method ↓↓                  size →→  |     10|    100|   1000|  10000| 100000|1000000
------------------------------------+-------+-------+-------+-------+-------+-------
np.char.str_len                     |  0.006|  0.037|  0.350|  3.566| 34.781|345.803
list comprehension                  |  0.005|  0.036|  0.312|  2.970| 28.783|293.715
list comprehension after .tolist()  |  0.002|  0.011|  0.117|  1.119| 12.863|133.886
map                                 |  0.002|  0.008|  0.080|  0.745|  9.374|103.749
np.frompyfunc                       |  0.004|  0.011|  0.089|  0.861|  8.824| 88.739
np.vectorize                        |  0.025|  0.032|  0.132|  1.046| 12.112|133.863
safe argmax                         |  0.026|  0.026|  0.056|  0.290|  2.827| 32.583

 non-contiguous
method ↓↓                  size →→  |     10|    100|   1000|  10000| 100000|1000000
------------------------------------+-------+-------+-------+-------+-------+-------
np.char.str_len                     |  0.006|  0.037|  0.349|  3.575| 34.525|344.859
list comprehension                  |  0.005|  0.032|  0.306|  2.963| 29.445|292.527
list comprehension after .tolist()  |  0.002|  0.011|  0.117|  1.043| 11.081|130.644
map                                 |  0.002|  0.008|  0.081|  0.731|  7.967| 99.848
np.frompyfunc                       |  0.005|  0.012|  0.099|  0.885|  9.221| 92.700
np.vectorize                        |  0.025|  0.033|  0.146|  1.063| 11.844|134.505
safe argmax                         |  0.026|  0.026|  0.057|  0.291|  2.997| 31.161

Code:

import numpy as np

flist = []
def timeme(name):
    def wrap_gen(f):
        flist.append((name, f))
        return(f)
    return wrap_gen

@timeme("np.char.str_len")
def np_char():
    return np.char.str_len(A)

@timeme("list comprehension")
def lst_cmp():
    return [len(a) for a in A]

@timeme("list comprehension after .tolist()")
def lst_cmp_opt():
    return [len(a) for a in A.tolist()]

@timeme("map")
def map_():
    return list(map(len, A.tolist()))

@timeme("np.frompyfunc")
def np_fpf():
    return np.frompyfunc(len, 1, 1)(A)

@timeme("np.vectorize")
def np_vect():
    return np.vectorize(len)(A)
    
@timeme("safe argmax")
def np_safe():
    assert A.dtype.kind=="U"
    # work around numpy's refusal to viewcast non contiguous arrays
    v = np.lib.stride_tricks.as_strided(
        A[0,None].view("u4"),(A.size,A.itemsize>>2),(A.strides[0],4))
    v = v[:,::-1].astype(bool)
    l = v.argmax(1)
    empty = (~(v[:,0]|l.astype(bool))).nonzero()
    l = v.shape[1]-l
    l[empty] = 0
    return l
    
A = np.random.choice(
    "Blind\x00text do not use the quick brown fox jumps over the lazy dog "
    .split(" "),1000000)[::2]

for _, f in flist[:-1]:
    assert (f()==flist[-1][1]()).all()

from timeit import timeit

for j,tag in [(1,"contiguous"),(2,"non-contiguous")]:
    print('\n',tag)
    L = ['|+' + len(flist)*'|',
         [f"{'method ↓↓                  size →→':36s}", 36*'-']
         + [f"{name:36s}" for name, f in flist]]
    for N in (10, 100, 1000, 10000, 100000, 1000000):
        A = np.random.choice("Blind\x00text do not use the quick brown fox"
                             " jumps over the lazy dog ".split(" "),j*N)[::j]
        L.append([f"{N:>7d}", 7*'-']
                 + [f"{timeit(f, number=10)*100:7.3f}" for name, f in flist])
    for sep, *line in zip(*L):
        print(*line, sep=sep)

6 Comments

Great answer! The trick with the argmin is nice, but it has one edge case: it finds the first '\x00', which not necessarily the end of the string. In most cases, especially unicode strings, the only '\x00' are at the end, but some strings (like bytes) may contain some at any position. In order to find the size of the string, we need to count how many '\x00' there are starting from the right. This may be done for instance by (numpy.cumsum((v != 0)[:, ::-1],axis=1) > 0).sum(axis=1)
Hi @M1L0U It has been a while since I wrote that answer, but I think it is correct. Note that I'm not casting to byte but to uint32 matching the fact that numpy U... dtype is a 4 byte fixed width encoding. I don't think four aligned zero bytes are possible anywhere but as endmarker. Or are they?
It is technically possible, but very strange. I don't see use cases except "wrongly" converting raw bytes into unicode strings. A = numpy.array([ 'ABC\x00DEF', 'A\x00BC']) is a perfectly valid U7 array from numpy's point of view
@M1L0U Holy cow, you are right. Apparently, u+0000 is valid unicode. I'll update the post.
And because I love edge cases, here is another one ^^ (might be useful to someone Googling for this). If the array in not C-continuous, then the view(np.uint32) will fail. Such things typically occur after slicing with step!=1. For instance doing A[::-1]. To fix this, you can add A = numpy.asarray(A, order='C') before the .view. Note that it copies the data only when needed
|
13

Using str_len from Numpy:

sizes = np.char.str_len(arr)

str_len documentation: https://numpy.org/devdocs/reference/generated/numpy.char.str_len.html

1 Comment

This is by far the best answer to the question. Please upvote it! It needs to be at the top!
7

For me this would be the way to go :

sizes = [len(i) for i in arr]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.