0

I am trying to apply vectorization with custom function on numpy string arrays.

Example:

import numpy

test_array = numpy.char.array(["sample1-sample","sample2-sample"])

numpy.char.array(test_array.split('-'))[:,0]

Op:

chararray([b'sample1', b'sample2'], dtype='|S7')

But these are in-built functions, is there any other method to achieve vectorization with custom functions. Example, with the following function:

def custom(text):
    return text[0]
2
  • You can vectorize any function with np.vectorize numpy.org/doc/stable/reference/generated/numpy.vectorize.html . However its mostly syntactic sugar and its implementation is a for loop according to numpy themselves. I'm not sure what you're trying to achieve but using built-in fonctions is generally goood practice. Commented Oct 5, 2021 at 14:33
  • 1
    Even the np.char functions use python string methods, and aren't faster than a list comprehension with same method. Fast 'vectorization' as done with numeric dtypes isn't possible with strings. Commented Oct 5, 2021 at 14:43

1 Answer 1

1

numpy doesn't implement fast string methods (as it does for numeric dtypes). So the np.char code is more for convenience than performance.

In [124]: alist=["sample1-sample","sample2-sample"]
In [125]: arr = np.array(alist)
In [126]: carr = np.char.array(alist)

A straightforward list comprehension versus your code:

In [127]: [item.split('-')[0] for item in alist]
Out[127]: ['sample1', 'sample2']
In [128]: np.char.array(carr.split('-'))[:,0]
Out[128]: chararray([b'sample1', b'sample2'], dtype='|S7')
In [129]: timeit [item.split('-')[0] for item in alist]
664 ns ± 32.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [130]: timeit np.char.array(carr.split('-'))[:,0]
20.5 µs ± 297 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

For the simple task of clipping the strings, there is a fast numpy way - using a shorter dtype:

In [131]: [item[0] for item in alist]
Out[131]: ['s', 's']
In [132]: carr.astype('S1')
Out[132]: chararray([b's', b's'], dtype='|S1')

But assuming that's just an example, not your real world custom action, I suggest using lists.

np.char recommends using the np.char functions and ordinary array instead of np.char.array. The functionality is basically the same. But using the arr above:

In [140]: timeit np.array(np.char.split(arr, '-').tolist())[:,0]
13.8 µs ± 90.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

np.char functions often produce string dtype arrays, but split creates an object dtype array of lists:

In [141]: np.char.split(arr, '-')
Out[141]: 
array([list(['sample1', 'sample']), list(['sample2', 'sample'])],
      dtype=object)

Object dtype arrays are essentially lists.

In [145]: timeit [item[0] for item in np.char.split(arr, '-').tolist()]
9.08 µs ± 27.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Your code is relatively slow because it takes time to convert this array of lists into a new chararray.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.