Length of each string in a NumPy array

Question

Is there any builtin operation in NumPy that returns the length of each string in an array?

I don't think any of the NumPy string operations does that, is this correct?

I can do it with a for loop, but maybe there's something more efficient?

import numpy as np
arr = np.array(['Hello', 'foo', 'and', 'whatsoever'], dtype='S256')

sizes = []
for i in arr:
    sizes.append(len(i))

print(sizes)
[5, 3, 3, 10]

For modest size arrays, the list comprehension equivalent is good: [len(i) for i in arr]. The np.char functions aren't speedy either, since they still have to apply string methods to each element. Don't — hpaulj
– hpaulj, Commented Jun 16, 2017 at 17:13

Jay Parikh · Accepted Answer · 2017-06-16 11:15:45Z

26

You can use vectorize of numpy. It is much faster.

mylen = np.vectorize(len)
print mylen(arr)

answered Jun 16, 2017 at 11:15

Jay Parikh

2,49919 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

hpaulj Over a year ago

In my timings mylen is noticeably slower than the list comprehension for this small example array, and barely faster for one that is 1000x large. vectorize does not promise speed. It does make iterating over all elements of a multidimensional array easier.

Jay Parikh Over a year ago

@hpaulj i was referring to for loop. It is faster than normal for and if data is huge, it is still faster than list comprehension. Data matters. :)

citynorman Over a year ago

Comparing to pandas, this was faster than df['s'].str.len()

Jarad Over a year ago

Is there no one-liner for this standard operation in Numpy? np.vectorize(len)(arr) is weird.

Jim Pivarski Over a year ago

The reason np.vectorize is not faster is because np.vectorize just runs a Python for loop—it does not compile. This deficiency is, in fact, why Numba was written. Here's Travis Oliphant (author of NumPy) saying exactly that: youtu.be/QpaapVaL8Fw?t=1621

Paul Panzer · Accepted Answer · 2020-06-23 23:57:00Z

17

UPDATE 06/20: Cater for u+0000 character and non-contiguous inputs - thanks @ M1L0U

Here is a comparison of a couple of methods.

Observations:

For input size >1000 lines, viewcasting + argmax is consistently and by a large margin fastest.
Python solutions profit from converting the array to a list first.
map beats list comprehension
np.frompyfunc and to a lesser degree np.vectorize fare better than their reputation

.

 contiguous
method ↓↓                  size →→  |     10|    100|   1000|  10000| 100000|1000000
------------------------------------+-------+-------+-------+-------+-------+-------
np.char.str_len                     |  0.006|  0.037|  0.350|  3.566| 34.781|345.803
list comprehension                  |  0.005|  0.036|  0.312|  2.970| 28.783|293.715
list comprehension after .tolist()  |  0.002|  0.011|  0.117|  1.119| 12.863|133.886
map                                 |  0.002|  0.008|  0.080|  0.745|  9.374|103.749
np.frompyfunc                       |  0.004|  0.011|  0.089|  0.861|  8.824| 88.739
np.vectorize                        |  0.025|  0.032|  0.132|  1.046| 12.112|133.863
safe argmax                         |  0.026|  0.026|  0.056|  0.290|  2.827| 32.583

 non-contiguous
method ↓↓                  size →→  |     10|    100|   1000|  10000| 100000|1000000
------------------------------------+-------+-------+-------+-------+-------+-------
np.char.str_len                     |  0.006|  0.037|  0.349|  3.575| 34.525|344.859
list comprehension                  |  0.005|  0.032|  0.306|  2.963| 29.445|292.527
list comprehension after .tolist()  |  0.002|  0.011|  0.117|  1.043| 11.081|130.644
map                                 |  0.002|  0.008|  0.081|  0.731|  7.967| 99.848
np.frompyfunc                       |  0.005|  0.012|  0.099|  0.885|  9.221| 92.700
np.vectorize                        |  0.025|  0.033|  0.146|  1.063| 11.844|134.505
safe argmax                         |  0.026|  0.026|  0.057|  0.291|  2.997| 31.161

Code:

import numpy as np

flist = []
def timeme(name):
    def wrap_gen(f):
        flist.append((name, f))
        return(f)
    return wrap_gen

@timeme("np.char.str_len")
def np_char():
    return np.char.str_len(A)

@timeme("list comprehension")
def lst_cmp():
    return [len(a) for a in A]

@timeme("list comprehension after .tolist()")
def lst_cmp_opt():
    return [len(a) for a in A.tolist()]

@timeme("map")
def map_():
    return list(map(len, A.tolist()))

@timeme("np.frompyfunc")
def np_fpf():
    return np.frompyfunc(len, 1, 1)(A)

@timeme("np.vectorize")
def np_vect():
    return np.vectorize(len)(A)
    
@timeme("safe argmax")
def np_safe():
    assert A.dtype.kind=="U"
    # work around numpy's refusal to viewcast non contiguous arrays
    v = np.lib.stride_tricks.as_strided(
        A[0,None].view("u4"),(A.size,A.itemsize>>2),(A.strides[0],4))
    v = v[:,::-1].astype(bool)
    l = v.argmax(1)
    empty = (~(v[:,0]|l.astype(bool))).nonzero()
    l = v.shape[1]-l
    l[empty] = 0
    return l
    
A = np.random.choice(
    "Blind\x00text do not use the quick brown fox jumps over the lazy dog "
    .split(" "),1000000)[::2]

for _, f in flist[:-1]:
    assert (f()==flist[-1][1]()).all()

from timeit import timeit

for j,tag in [(1,"contiguous"),(2,"non-contiguous")]:
    print('\n',tag)
    L = ['|+' + len(flist)*'|',
         [f"{'method ↓↓                  size →→':36s}", 36*'-']
         + [f"{name:36s}" for name, f in flist]]
    for N in (10, 100, 1000, 10000, 100000, 1000000):
        A = np.random.choice("Blind\x00text do not use the quick brown fox"
                             " jumps over the lazy dog ".split(" "),j*N)[::j]
        L.append([f"{N:>7d}", 7*'-']
                 + [f"{timeit(f, number=10)*100:7.3f}" for name, f in flist])
    for sep, *line in zip(*L):
        print(*line, sep=sep)

edited Jun 23, 2020 at 23:57

answered Sep 28, 2018 at 0:45

Paul Panzer

53.3k3 gold badges59 silver badges103 bronze badges

6 Comments

Emile Over a year ago

Great answer! The trick with the argmin is nice, but it has one edge case: it finds the first '\x00', which not necessarily the end of the string. In most cases, especially unicode strings, the only '\x00' are at the end, but some strings (like bytes) may contain some at any position. In order to find the size of the string, we need to count how many '\x00' there are starting from the right. This may be done for instance by (numpy.cumsum((v != 0)[:, ::-1],axis=1) > 0).sum(axis=1)

Paul Panzer Over a year ago

Hi @M1L0U It has been a while since I wrote that answer, but I think it is correct. Note that I'm not casting to byte but to uint32 matching the fact that numpy U... dtype is a 4 byte fixed width encoding. I don't think four aligned zero bytes are possible anywhere but as endmarker. Or are they?

Emile Over a year ago

It is technically possible, but very strange. I don't see use cases except "wrongly" converting raw bytes into unicode strings. A = numpy.array([ 'ABC\x00DEF', 'A\x00BC']) is a perfectly valid U7 array from numpy's point of view

Paul Panzer Over a year ago

@M1L0U Holy cow, you are right. Apparently, u+0000 is valid unicode. I'll update the post.

Emile Over a year ago

And because I love edge cases, here is another one ^^ (might be useful to someone Googling for this). If the array in not C-continuous, then the view(np.uint32) will fail. Such things typically occur after slicing with step!=1. For instance doing A[::-1]. To fix this, you can add A = numpy.asarray(A, order='C') before the .view. Note that it copies the data only when needed

|

SRIDHARAN · Accepted Answer · 2020-06-25 06:35:34Z

13

Using str_len from Numpy:

sizes = np.char.str_len(arr)

str_len documentation: https://numpy.org/devdocs/reference/generated/numpy.char.str_len.html

edited Jun 25, 2020 at 6:35

SRIDHARAN

1,2221 gold badge17 silver badges37 bronze badges

answered Jun 25, 2020 at 3:50

Rahim Yazid

1391 silver badge2 bronze badges

1 Comment

Jim Pivarski Over a year ago

This is by far the best answer to the question. Please upvote it! It needs to be at the top!

Anthony Rossi · Accepted Answer · 2017-06-16 11:10:16Z

7

For me this would be the way to go :

sizes = [len(i) for i in arr]

answered Jun 16, 2017 at 11:10

Anthony Rossi

1,2729 silver badges14 bronze badges

Collectives™ on Stack Overflow

Length of each string in a NumPy array

4 Answers 4

5 Comments

6 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

6 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related