Numpy Number Patterns

Question

Is there a function in Numpy that allows you to take 4 records at a time and see where they match with a second dataset? Once there is a match move to the next 4 records of the first data set. It wont always be every 4 records, but i am using this as an example.

So if dataset one had - 1,5,7,8,10,12,6,1,3,6,8,9

And the second dataset had - 1,5,7,8,11,15,6,1,3,6,10,6

My result will be:  1,5,7,8, 6,1,3,6

POST EDIT: My second example datasets:

 import numpy as np

 a =np.array([15,15,0,0,10,10,0,0,2,1,8,8,42,2,4,4,3,1,1,3,5,6,0,9,47,1,1,7,7,0,0,45,12,17,45])

 b = np.array ([6,0,0,15,15,0,0,10,10,0,0,2,1,8,8,42,2,4,4,3,3,4,6,0,9,47,1,1,7,7,0,0,45,12,16,1,9,3,30])

Here's another snapshot of an example:

Thank you in advance for looking at my question!!

Paul Panzer · Accepted Answer · 2017-11-22 15:20:32Z

1

Update: for the more difficult and more interesting alignment problem it is probably best not to reinvent the wheel but to rely on python's difflib:

from difflib import SequenceMatcher
import numpy as np

k=4

a = np.array([15,15,0,0,10,10,0,0,2,1,8,8,42,2,4,4,3,1,1,3,5,6,0,9,47,1,1,7,7,0,0,45,12,17,45])

b = np.array ([6,0,0,15,15,0,0,10,10,0,0,2,1,8,8,42,2,4,4,3,3,4,6,0,9,47,1,1,7,7,0,0,45,12,16,1,9,3,30])

sm = SequenceMatcher(a=a, b=b)
matches = sm.get_matching_blocks()
matches = [m for m in matches if m.size >= k]
# [Match(a=0, b=3, size=17), Match(a=21, b=22, size=12)]
consensus = [a[m.a:m.a+m.size] for m in matches]
# [array([15, 15,  0,  0, 10, 10,  0,  0,  2,  1,  8,  8, 42,  2,  4,  4,  3]), array([ 6,  0,  9, 47,  1,  1,  7,  7,  0,  0, 45, 12])]
consfour = [a[m.a:m.a + m.size // k * k] for m in matches]
# [array([15, 15,  0,  0, 10, 10,  0,  0,  2,  1,  8,  8, 42,  2,  4,  4]), array([ 6,  0,  9, 47,  1,  1,  7,  7,  0,  0, 45, 12])]
summary = [np.c_[np.add.outer(np.arange(m.size // k * k), (m.a, m.b)), c]
           for m, c in zip(matches, consfour)]
merge = np.concatenate(summary, axis=0)

Below is my original solution assuming already aligned and same-length arrays:

Here is a hybrid solution using numpy to find consecutive matches and cutting them out and then list comp to apply length constraints:

import numpy as np

d1 = np.array([7,1,5,7,8,0,6,9,0,10,12,6,1,3,6,8,9])
d2 = np.array([8,1,5,7,8,0,6,9,0,11,15,6,1,3,6,10,6])

k = 4

# find matches
m = d1 == d2

# find switches between match, no match
sw = np.where(m[:-1] != m[1:])[0] + 1

# split
mnm = np.split(d1, sw)

# select matches
ones_ = mnm[1-m[0]::2]

# apply length constraint
res = [blck[i:i+k] for blck in ones_ for i in range(len(blck)-k+1)]
# [array([1, 5, 7, 8]), array([5, 7, 8, 0]), array([7, 8, 0, 6]), array([8, 0, 6, 9]), array([0, 6, 9, 0]), array([6, 1, 3, 6])]
res_no_ovlp = [blck[k*i:k*i+k] for blck in ones_ for i in range(len(blck)//k)]
# [array([1, 5, 7, 8]), array([0, 6, 9, 0]), array([6, 1, 3, 6])]

edited Nov 22, 2017 at 15:20

answered Nov 21, 2017 at 1:47

Paul Panzer

53.3k3 gold badges60 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

13 Comments

yanci Over a year ago

Hi @Paul sorry for the late reply, but I received a msg that the question wasn't posted. I tried your solution with a larger dataset, and it didn't work. I posted an image of an more detailed example. My actual datasets range around 4k records. Thank you so much for your help!!

yanci Over a year ago

I tried it with data set example that I added in the post-- I typed out the array, to help out on the typing :-) and I got error: sw = np.where(m[:-1] != m[1:])[0] + 1 TypeError: 'bool' object is not subscriptable

yanci Over a year ago

m is a bool not an array

yanci Over a year ago

both of d1 and d2 are arrays. sorry, but I just started numpy so I'm trying to figure things out as I go. thanks again.

Paul Panzer Over a year ago

@yanci, ok the problem seems to be that your a and b have different lengths. (35 and 39). I thought that wasn't allowed. If it is, you'd have to better explain by which rules your groups of four can be aligned to match them.

|

score 1 · Accepted Answer · 2017-11-21 22:54:50Z

1

You can use matrix masking like,

import numpy as np
from scipy.sparse import dia_matrix
a = np.array([1,5,7,8,10,12,6,1,3,6,8,9])
b = np.array([1,5,7,8,11,15,6,1,3,6,10,6])
mask = dia_matrix((np.ones((1, a.size)).repeat(4, axis=0), np.arange(4)),
                  shape=(a.size, b.size), dtype=np.int)
print(mask.toarray())
matches = a[mask.T.dot(mask.dot(a == b) == 4).astype(np.bool)]
print(matches)

This will output,

array([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])
[1 5 7 8 6 1 3 6]

You can think about how the matrix multiplication works to get this result.

Scaling

For scaling, I tested with 1e3, 1e5, and 1e7 elements and got,

1e3 - 0.019184964010491967
1e5 - 0.4330314120161347
1e7 - 144.54082221200224

See the gist. Not sure why such a hard jump at 1e7 elements.

edited Nov 21, 2017 at 22:54

answered Nov 20, 2017 at 23:59

user3002273

6 Comments

innisfree Over a year ago

That's clever, but does it require that the two lists are the same lenght?

Paul Panzer Over a year ago

Clever, indeed, but maybe use a sparse matrix, so it doesn't become O(n^2)?

yanci Over a year ago

Hi @aidan.plenert.macdonald, I tried your solution with a larger dataset, and I received a too many indices for array error. I posted an image of a more detailed example. My actual datasets range around 4k records. Thank you so much for your help!!

user3002273 Over a year ago

@innisfree As currently written, yes, but the actual masking trick doesn't require same length. I'm updating it now, but I think the OP only cared about same length lists.

user3002273 Over a year ago

@PaulPanzer Changed to sparse. Actually makes the code more intuitive.

|

innisfree · Accepted Answer · 2017-11-21 03:44:09Z

This is an exercise is list comprehension. We have the data

data = [1,5,7,8,10,12,6,1,3,6,8,9]
search_data = [1,5,7,8,11,15,6,1,3,6,10,6]

First we can chunk the original data into blocks of length n

n = 4
chunks = [data[i:i + n] for i in range(len(data) - n + 1)]
search_chunks = [search_data[i:i + n] for i in range(len(search_data) - n + 1)]

Now we must select chunks from the first list that appear in the second list

hits = [c for c in chunks if c in search_chunks]
print hits
# [[1, 5, 7, 8], [6, 1, 3, 6]]

This may not be the optimal solution for long lists. It may improve performance to consider sets, if there are likely to repeated chunks

chunks = set(tuple(data[i:i + n]) for i in range(len(data) - n + 1))
search_chunks = set(tuple(search_data[i:i + n]) for i in range(len(search_data) - n + 1))

This can be quite competitive with above numpy solution, e.g.

import numpy as np
import time

# Generate data

len_ = 10000
max_ = 10
data = map(int, np.random.rand(len_) * max_)
search_data = map(int, np.random.rand(len_) * max_)

# Time list comprehension

start = time.time()

n = 4
chunks = set(tuple(data[i:i + n]) for i in range(len(data) - n + 1))
search_chunks = set(tuple(search_data[i:i + n]) for i in range(len(search_data) - n + 1))
hits = [c for c in chunks if c in search_chunks]

print time.time() - start

# Time numpy

a = np.array(data)
b = np.array(search_data)
mask = 1 * (np.abs(np.arange(a.size).reshape((-1, 1)) - np.arange(a.size) - 0.5) < 2)

start = time.time()

matches = a[mask.T.dot(mask.dot(a == b) == 4).astype(np.bool)]

print time.time() - start

It's typically faster here, but it depends on number of repeated chunks etc.

Collectives™ on Stack Overflow

Numpy Number Patterns

3 Answers 3

13 Comments

Scaling

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

13 Comments

Scaling

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related