I'm using a numpy object_ array to store variable length strings, e.g.
a = np.array(['hello','world','!'],dtype=np.object_)
Is there an easy way to find the length of the longest string in the array without looping over all elements?
If you store the string in a numpy array of dtype object, then you can't get at the size of the objects (strings) without looping. However, if you let np.array decide the dtype, then you can find out the length of the longest string by peeking at the dtype:
In [64]: a = np.array(['hello','world','!','Oooh gaaah booo gaah?'])
In [65]: a.dtype
Out[65]: dtype('|S21')
In [72]: a.dtype.itemsize
Out[72]: 21
Say I want to get the longest string in the second column:
data_array = [['BFNN' 'Forested bog without permafrost or patterning, no internal lawns']
['BONS' 'Nonpatterned, open, shrub-dominated bog']]
def get_max_len_column_value(data_array, column):
return len(max(data_array[:,[column]], key=len)[0])
get_max_len_column_value(data_array, 1)
>>>64
I would also like to mention a C-like method:
int(string_array.dtype.itemsize/np.dtype(string_array.dtype.char+'1').itemsize)
It seems to be more efficient than the accepted answer:
codes_len = 10000
codes_size = 10000
string_array = np.random.choice(np.array([b'a', b'b']), [codes_size, codes_len])
string_array = np.array([b"".join(string_array[i]).decode('utf-8') for i in range(codes_size)])
%time res = int(string_array.dtype.itemsize/np.dtype(string_array.dtype.char+'1').itemsize)
print('result is:', str(res) + '\n')
>>> CPU times: user 21 µs, sys: 4 µs, total: 25 µs
>>> Wall time: 29.1 µs
>>> result is: 10000
%time res = len(max(string_array, key=len))
print('result is:', res)
>>> CPU times: user 333 ms, sys: 8.21 ms, total: 342 ms
>>> Wall time: 341 ms
>>> result is: 10000