8

I am trying to find patterns in a numpy array, called values. I'd like to return the starting index position of the pattern. I know I could iterative over each element and check whether that element and the next one match the pattern, but over a large dataset that is incredibly inefficient and am looking for a better alternative.

I've got a working solution using np.where for searching for a single value, but I can't get it to work with finding a pattern or two numbers.

Example:

import numpy as np
values = np.array([0,1,2,1,2,4,5,6,1,2,1])
searchval = [1,2]
print  np.where(values == searchval)[0]

Output:

[]

Expected Output:

[1, 3, 8]
1

6 Answers 6

7

Couldn't you simply use np.where (assuming this is the optimal way to find an element) and then only check pattens which satisfy the first condition.

import numpy as np
values = np.array([0,1,2,1,2,4,5,6,1,2,1])
searchval = [1,2]
N = len(searchval)
possibles = np.where(values == searchval[0])[0]

solns = []
for p in possibles:
    check = values[p:p+N]
    if np.all(check == searchval):
        solns.append(p)

print(solns)
Sign up to request clarification or add additional context in comments.

2 Comments

If the input is random (there are a few repeat in values) this solution will be fast
This fails if the searchval[0] is closer to the end of the array than the length of searchval, e.g.: values = np.array([0,1,2,1,2,4,5,6,1,2,1,3]) with searchval = [1,2,3], because it results in different lengths for check and searchval. This can be solved with an extra check of the lengths or a try/except block.
7

Here's a straight forward approach to using where. Start with a logical expression that finds the matches:

In [670]: values = np.array([0,1,2,1,2,4,5,6,1,2,1])
     ...: searchval = [1,2]
     ...: 
In [671]: (values[:-1]==searchval[0]) & (values[1:]==searchval[1])
Out[671]: array([False,  True, False,  True, False, False, False, False,  True, False], dtype=bool)
In [672]: np.where(_)
Out[672]: (array([1, 3, 8], dtype=int32),)

That could be generalized into a loop that operates on multiple searchval. Getting the slice range correct will take some fiddling. The roll suggested in another answer might be easier, but I suspect a bit slower.

As long as searchval is small compared to values this general approach should be efficient. There is a np.in1d that does this sort of match, but with a or test. So it isn't applicable. But it too uses this iterative approach is the searchval list is small enough.

Generalized slicing

In [716]: values
Out[716]: array([0, 1, 2, 1, 2, 4, 5, 6, 1, 2, 1])
In [717]: searchvals=[1,2,1]
In [718]: idx = [np.s_[i:m-n+1+i] for i in range(n)]
In [719]: idx
Out[719]: [slice(0, 9, None), slice(1, 10, None), slice(2, 11, None)]
In [720]: [values[idx[i]] == searchvals[i] for i in range(n)]
Out[720]: 
[array([False,  True, False,  True, False, False, False, False,  True], dtype=bool),
 array([False,  True, False,  True, False, False, False, False,  True], dtype=bool),
 array([False,  True, False, False, False, False,  True, False,  True], dtype=bool)]
In [721]: np.all(_, axis=0)
Out[721]: array([False,  True, False, False, False, False, False, False,  True], dtype=bool)
In [722]: np.where(_)
Out[722]: (array([1, 8], dtype=int32),)

I used the intermediate np.s_ to look at the slices and make sure they look reasonable.

as_strided

An advanced trick would be to use as_strided to construct the 'rolled' array and perform a 2d == test on that. as_strided is neat but tricky. To use it correctly you have to understand strides, and get the shape correct.

In [740]: m,n = len(values), len(searchvals)
In [741]: values.shape
Out[741]: (11,)
In [742]: values.strides
Out[742]: (4,)
In [743]: 
In [743]: M = as_strided(values, shape=(n,m-n+1),strides=(4,4))
In [744]: M
Out[744]: 
array([[0, 1, 2, 1, 2, 4, 5, 6, 1],
       [1, 2, 1, 2, 4, 5, 6, 1, 2],
       [2, 1, 2, 4, 5, 6, 1, 2, 1]])
In [745]: M == np.array(searchvals)[:,None]
Out[745]: 
array([[False,  True, False,  True, False, False, False, False,  True],
       [False,  True, False,  True, False, False, False, False,  True],
       [False,  True, False, False, False, False,  True, False,  True]], dtype=bool)
In [746]: np.where(np.all(_,axis=0))
Out[746]: (array([1, 8], dtype=int32),)

2 Comments

Good idea on slicing, should be pretty efficient for decent sized patterns.
Can you use the as_strided on a matrix?
3

I think this does the job:

np.where((values == 1) & (np.roll(values,-1) == 2))[0]

1 Comment

Since we have a great collection of soutions I was currious about the run times and found the hpaulij solutions are around 2 times faster on pretty long random arrays (e.g. 1 mil entries) compared to the simple roll. The solution by Ed Smith is about 100 times slower and the one by betontalpfa another 100 times. Number of entries and hits changes the numbers quite a bit but not the overall ranking.
2

If the input is random Ed Smith solution is faster. But if you has a few set of available values this hash-solution can help:

"""
Can be replaced with any revertable hash
"""
def my_hash(rem, h, add):
    return rem^h^add

"""
Imput
"""
values = np.array([0,1,2,1,2,4,5,6,1,2,1])
searchval = [1,2]


"""
Prepare
"""
sh = 0
vh = 0
ls = len(searchval)
lv = len(values)

for i in range(0, len(searchval)):
    vh = my_hash(0, vh, values[i])
    sh = my_hash(0, sh, searchval[i])

"""
Find matches
"""
for i in range(0, lv-ls):
    if sh == vh:
        eq = True
        for j in range(0, ls):
            if values[i+j] != searchval[j]:
                eq = False
                break
        if eq:
            print i
    vh = my_hash(values[i], vh, values[i+ls])

Comments

2

Compact straitforward solution will be a "legal" variant of as_strided solution. Others have mentioned np.roll. But here is an universal solution with the only circle (132 µs).

seq = np.array([0,1,2,1,2,4,5,6,1,2,1])
patt = np.array([1,2])

Seq = np.vstack([np.roll(seq, shift) for shift in -np.arange(len(patt))]).T
where(all(Seq == patt, axis=1))[0]

The another option for sequences with small integers will be converting to string. It is faster per near 6 times (20 µs). For small positive integers only!

import re

def to_string(arr):
    return ''.join(map(chr, arr))

array([m.start() for m in re.finditer(to_string(patt), to_string(seq))])

Comments

0

Another approach is to use scipy.ndimage.generic_filter to build a pattern filter, which also supports multidimensional patterns.

import numpy as np
import scipy.signal, scipy.ndimage

def pattern_filter(input, pattern):
    input = np.array(input)
    pattern = np.array(pattern)
    
    def _pattern_filter_(_input_, _pattern_):
        return np.all(_input_ == _pattern_.flatten())
    
    return scipy.ndimage.generic_filter(
        input, 
        _pattern_filter_, 
        size=pattern.shape,
        origin=-(np.array(pattern.shape)//2), 
        extra_arguments=(pattern,)).astype(bool)
    
values = np.array([0,1,2,1,2,4,5,6,1,2,1])
pattern = [1,2]
overlap = pattern_filter(values, pattern)
np.argwhere(overlap)
# array([[1], [3], [8]])

Multidimensional example:

values = np.array([
    [0,1,2,1,2,4,5,6,1,2,1],
    [0,2,1,2,1,4,5,6,2,1,1],
    [0,0,2,1,0,0,0,0,0,0,0],
])
pattern = [[1,2], 
           [2,1]]
overlap = pattern_filter(values, pattern)
np.argwhere(overlap) 
# array([[0, 1], [0, 3], [0, 8], [1, 2]])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.