I have a 1d numpy array of strings (dtype='U') called ops of length 15MM where I need to find all the indices where I find a string called op 83,000 times.
So far numpy is winning the race, but it still takes like 3 hours: indices = np.where(ops==op)
I also tried np.unravel_index(np.where(ops.ravel()==op), ops.shape)[0][0] without much of a difference.
I'm trying a cython approach with random data similar to the original, but its about 40 times slower than numpys solution. It's my first cython code maybe I can improve it. Cython code:
import numpy as np
cimport numpy as np
def get_ixs(np.ndarray data, str x, np.ndarray[int,mode="c",ndim=1] xind):
cdef int count, n, i
count = 0
n = data.shape[0]
i = 0
while i < n:
if (data[i] == x):
xind[count] = i
count += 1
i += 1
return xind[0:count]
numpy.wherehere. You're maybe saving the creation of one temporary array. It might be worth trying with a Python list of unicode strings instead of a Numpy array - although it sounds counter-intutirve it's possible that you have lots of inefficient Numpy C string<->Python string conversions that you could be avoiding.