0

I have a 1d numpy array of strings (dtype='U') called ops of length 15MM where I need to find all the indices where I find a string called op 83,000 times.

So far numpy is winning the race, but it still takes like 3 hours: indices = np.where(ops==op) I also tried np.unravel_index(np.where(ops.ravel()==op), ops.shape)[0][0] without much of a difference.

I'm trying a cython approach with random data similar to the original, but its about 40 times slower than numpys solution. It's my first cython code maybe I can improve it. Cython code:

import numpy as np
cimport numpy as np

def get_ixs(np.ndarray data, str x, np.ndarray[int,mode="c",ndim=1] xind):
    cdef int count, n, i
    count = 0
    n = data.shape[0]
    i = 0
    while i < n:
        if (data[i] == x):
            xind[count] = i
            count += 1
        i += 1

    return xind[0:count]
1
  • 1
    I don't really see why Cython would be expected to to beat numpy.where here. You're maybe saving the creation of one temporary array. It might be worth trying with a Python list of unicode strings instead of a Numpy array - although it sounds counter-intutirve it's possible that you have lots of inefficient Numpy C string<->Python string conversions that you could be avoiding. Commented Dec 24, 2020 at 15:44

1 Answer 1

1

If you call get_ixs several times with the same data, the fastest solution is to preprocess data into a dict to then get O(1) lookups (constant time) when querying strings.
A key of the dict is a string x and the value for this key is a list containing the indices satisfying data[i] == x.
Here is the code :

import numpy as np

data = np.array(["toto", "titi", "toto", "titi", "tutu"])

indices = np.arange(len(data))
# sort data so that we can construct the dict by replacing list with ndarray as soon as possible (when string changes) to reduce memory usage
indices_data_sorted = np.argsort(data)  
data = data[indices_data_sorted]
indices = indices[indices_data_sorted]

# construct the dict str -> ndarray of indices (use ndarray for lower memory consumption)
dict_str_to_indices = dict()
prev_str = None
list_idx = []  # list to hold the indices for a given string
for i, s in zip(indices, data):
    if s != prev_str:  
        # the current string has changed so we can construct the ndarray and store it in the dict
        if prev_str is not None:
            dict_str_to_indices[prev_str] = np.array(list_idx, dtype="int32")
        list_idx.clear()
        prev_str = s
    list_idx.append(i)
    
dict_str_to_indices[s] = np.array(list_idx, dtype="int32")  # add the ndarray for last string

def get_ixs(dict_str_to_indices: dict, x: str):
    return dict_str_to_indices[x]

print(get_ixs(dict_str_to_indices, "toto"))
print(get_ixs(dict_str_to_indices, "titi"))
print(get_ixs(dict_str_to_indices, "tutu"))

Output :

[0 2]
[1 3]
[4]

If you call get_ixs many times with the same dict_str_to_indices, it is the optimal asymptotic solution ( O(1) lookup).

Sign up to request clarification or add additional context in comments.

7 Comments

Sure, but how do I feed the values of each key? (finding the indices of each string coincidence in the array)
I'm going to update my answer with the code
if i is a list data[i] is a list and it can't be compared to a string x with data[i]==x. But my question is how do I find the i list cointaining the indices of x in data in the first place.
I meant the value is a list of integer and each of them satisfy data[i] == x. I update my answer to make it clearer.
Marked as answer, this was about 15 times faster than numpy in the test. But memory could be a problem as data scales (it crashed my pc)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.