3

I need to find a small numpy array in a much larger numpy array. For example:

import numpy as np
a = np.array([1, 1])
b = np.array([2, 3, 3, 1, 1, 1, 8, 3, 1, 6, 0, 1, 1, 3, 4])

A function

find_numpy_array_in_other_numpy_array(a, b)

should return indices

[3, 4, 11]

that represent where the complete numpy array a appears in the complete numpy array b.

There is a brute force approach to this problem that is slow when dealing with very large b arrays:

ok = []
for idx in range(b.size - a.size + 1):
    if np.all(a == b[idx : idx + a.size]):
        ok.append(idx)

I am looking for a much faster way to find all indices of the full array a in array b. The fast approach should also allow other comparison functions, e.g. to find the worst case difference between a and b:

diffs = []
for idx in range(b.size - a.size + 1):
    bi = b[idx : idx + a.size]
    diff = np.nanmax(np.abs(bi - a))
    diffs.append(diff)

3 Answers 3

4

Generic solution setup

For a generic solution, we can create 2D array of sliding windows and then perform the relevant operations -

from skimage.util.shape import view_as_windows

b2D = view_as_windows(b,len(a))

NumPy equivalent implementation.

Problem #1

Then, to solve for matching indices problem, it's simply -

matching_indices = np.flatnonzero((b2D==a).all(axis=1))

Problem #2

To solve for the second problem, it maps easily by keeping in mind that any ufunc reduction operation to get an output element is to be translated into reduction along the second axis in the proposed solution using that ufunc's axis argument -

diffs = np.nanmax(np.abs(b2D-a),axis=1)
Sign up to request clarification or add additional context in comments.

1 Comment

Implementation with scikit-image is fantastic! A 5 minute code execution step was reduced to 5 seconds.
1

The following code finds all matches of 1st element in your sequence (a) in array b. Then it creates a new array with columns of your possible sequence candidates, compares them to the full sequence, and filters the initial indexes

seq, arr = a, b
len_seq = len(seq)    
ini_idx = (arr[:-len_seq+1]==seq[0]).nonzero()[0] # idx of possible sequence canditates   
seq_candidates = arr[np.arange(1, len_seq)[:, None]+ini_idx] # columns with possible seq. candidates   
mask = (seq_candidates==seq[1:,None]).all(axis=0)
idx = ini_idx[mask]

Comments

0

You can consider using Numba to compile the function. You could do it like this:

import numpy as np
import numba as nb

@nb.njit(parallel=True)
def search_in_array(a, b):
    idx = np.empty(len(b) - len(a) + 1, dtype=np.bool_)
    for i in nb.prange(len(idx)):
        idx[i] = np.all(a == b[i:i + len(a)])
    return np.where(idx)[0]

a = np.array([1, 1])
b = np.array([2, 3, 3, 1, 1, 1, 8, 3, 1, 6, 0, 1, 1, 3, 4])
print(search_in_array(a, b))
# [ 3  4 11]

A quick benchmark:

import numpy as np

np.random.seed(100)
a = np.random.randint(5, size=10)
b = np.random.randint(5, size=10_000_000)

# Non-compiled function
%timeit search_in_array.py_func(a, b)
# 22.8 s ± 242 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Compiled function
%timeit search_in_array(a, b)
# 54.7 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

As you see, you can get a ~400x speedup and the memory cost is relatively low (a boolean array the same size as the big array).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.