Speed up indices finding in numpy array

Question

I have a 1d numpy array of strings (dtype='U') called ops of length 15MM where I need to find all the indices where I find a string called op 83,000 times.

So far numpy is winning the race, but it still takes like 3 hours: indices = np.where(ops==op) I also tried np.unravel_index(np.where(ops.ravel()==op), ops.shape)[0][0] without much of a difference.

I'm trying a cython approach with random data similar to the original, but its about 40 times slower than numpys solution. It's my first cython code maybe I can improve it. Cython code:

import numpy as np
cimport numpy as np

def get_ixs(np.ndarray data, str x, np.ndarray[int,mode="c",ndim=1] xind):
    cdef int count, n, i
    count = 0
    n = data.shape[0]
    i = 0
    while i < n:
        if (data[i] == x):
            xind[count] = i
            count += 1
        i += 1

    return xind[0:count]

I don't really see why Cython would be expected to to beat numpy.where here. You're maybe saving the creation of one temporary array. It might be worth trying with a Python list of unicode strings instead of a Numpy array - although it sounds counter-intutirve it's possible that you have lots of inefficient Numpy C string<->Python string conversions that you could be avoiding. — DavidW
– DavidW, Commented Dec 24, 2020 at 15:44

Ismael EL ATIFI · Accepted Answer · 2020-12-24 17:24:15Z

1

If you call get_ixs several times with the same data, the fastest solution is to preprocess data into a dict to then get O(1) lookups (constant time) when querying strings.
A key of the dict is a string x and the value for this key is a list containing the indices satisfying data[i] == x.
Here is the code :

import numpy as np

data = np.array(["toto", "titi", "toto", "titi", "tutu"])

indices = np.arange(len(data))
# sort data so that we can construct the dict by replacing list with ndarray as soon as possible (when string changes) to reduce memory usage
indices_data_sorted = np.argsort(data)  
data = data[indices_data_sorted]
indices = indices[indices_data_sorted]

# construct the dict str -> ndarray of indices (use ndarray for lower memory consumption)
dict_str_to_indices = dict()
prev_str = None
list_idx = []  # list to hold the indices for a given string
for i, s in zip(indices, data):
    if s != prev_str:  
        # the current string has changed so we can construct the ndarray and store it in the dict
        if prev_str is not None:
            dict_str_to_indices[prev_str] = np.array(list_idx, dtype="int32")
        list_idx.clear()
        prev_str = s
    list_idx.append(i)
    
dict_str_to_indices[s] = np.array(list_idx, dtype="int32")  # add the ndarray for last string

def get_ixs(dict_str_to_indices: dict, x: str):
    return dict_str_to_indices[x]

print(get_ixs(dict_str_to_indices, "toto"))
print(get_ixs(dict_str_to_indices, "titi"))
print(get_ixs(dict_str_to_indices, "tutu"))

Output :

[0 2]
[1 3]
[4]

If you call get_ixs many times with the same dict_str_to_indices, it is the optimal asymptotic solution ( O(1) lookup).

edited Dec 24, 2020 at 17:24

answered Dec 24, 2020 at 15:15

Ismael EL ATIFI

2,12822 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Oliver Mohr Bonometti Over a year ago

Sure, but how do I feed the values of each key? (finding the indices of each string coincidence in the array)

Ismael EL ATIFI Over a year ago

I'm going to update my answer with the code

Oliver Mohr Bonometti Over a year ago

if i is a list data[i] is a list and it can't be compared to a string x with data[i]==x. But my question is how do I find the i list cointaining the indices of x in data in the first place.

Ismael EL ATIFI Over a year ago

I meant the value is a list of integer and each of them satisfy data[i] == x. I update my answer to make it clearer.

Oliver Mohr Bonometti Over a year ago

Marked as answer, this was about 15 times faster than numpy in the test. But memory could be a problem as data scales (it crashed my pc)

|

Collectives™ on Stack Overflow

Speed up indices finding in numpy array

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related