0

Is there a function in Numpy that allows you to take 4 records at a time and see where they match with a second dataset? Once there is a match move to the next 4 records of the first data set. It wont always be every 4 records, but i am using this as an example.

So if dataset one had - 1,5,7,8,10,12,6,1,3,6,8,9

And the second dataset had - 1,5,7,8,11,15,6,1,3,6,10,6

My result will be:  1,5,7,8, 6,1,3,6

POST EDIT: My second example datasets:

 import numpy as np

 a =np.array([15,15,0,0,10,10,0,0,2,1,8,8,42,2,4,4,3,1,1,3,5,6,0,9,47,1,1,7,7,0,0,45,12,17,45])

 b = np.array ([6,0,0,15,15,0,0,10,10,0,0,2,1,8,8,42,2,4,4,3,3,4,6,0,9,47,1,1,7,7,0,0,45,12,16,1,9,3,30])

Here's another snapshot of an example: enter image description here

Thank you in advance for looking at my question!!

3 Answers 3

1

Update: for the more difficult and more interesting alignment problem it is probably best not to reinvent the wheel but to rely on python's difflib:

from difflib import SequenceMatcher
import numpy as np

k=4

a = np.array([15,15,0,0,10,10,0,0,2,1,8,8,42,2,4,4,3,1,1,3,5,6,0,9,47,1,1,7,7,0,0,45,12,17,45])

b = np.array ([6,0,0,15,15,0,0,10,10,0,0,2,1,8,8,42,2,4,4,3,3,4,6,0,9,47,1,1,7,7,0,0,45,12,16,1,9,3,30])

sm = SequenceMatcher(a=a, b=b)
matches = sm.get_matching_blocks()
matches = [m for m in matches if m.size >= k]
# [Match(a=0, b=3, size=17), Match(a=21, b=22, size=12)]
consensus = [a[m.a:m.a+m.size] for m in matches]
# [array([15, 15,  0,  0, 10, 10,  0,  0,  2,  1,  8,  8, 42,  2,  4,  4,  3]), array([ 6,  0,  9, 47,  1,  1,  7,  7,  0,  0, 45, 12])]
consfour = [a[m.a:m.a + m.size // k * k] for m in matches]
# [array([15, 15,  0,  0, 10, 10,  0,  0,  2,  1,  8,  8, 42,  2,  4,  4]), array([ 6,  0,  9, 47,  1,  1,  7,  7,  0,  0, 45, 12])]
summary = [np.c_[np.add.outer(np.arange(m.size // k * k), (m.a, m.b)), c]
           for m, c in zip(matches, consfour)]
merge = np.concatenate(summary, axis=0)

Below is my original solution assuming already aligned and same-length arrays:

Here is a hybrid solution using numpy to find consecutive matches and cutting them out and then list comp to apply length constraints:

import numpy as np

d1 = np.array([7,1,5,7,8,0,6,9,0,10,12,6,1,3,6,8,9])
d2 = np.array([8,1,5,7,8,0,6,9,0,11,15,6,1,3,6,10,6])

k = 4

# find matches
m = d1 == d2

# find switches between match, no match
sw = np.where(m[:-1] != m[1:])[0] + 1

# split
mnm = np.split(d1, sw)

# select matches
ones_ = mnm[1-m[0]::2]

# apply length constraint
res = [blck[i:i+k] for blck in ones_ for i in range(len(blck)-k+1)]
# [array([1, 5, 7, 8]), array([5, 7, 8, 0]), array([7, 8, 0, 6]), array([8, 0, 6, 9]), array([0, 6, 9, 0]), array([6, 1, 3, 6])]
res_no_ovlp = [blck[k*i:k*i+k] for blck in ones_ for i in range(len(blck)//k)]
# [array([1, 5, 7, 8]), array([0, 6, 9, 0]), array([6, 1, 3, 6])]
Sign up to request clarification or add additional context in comments.

13 Comments

Hi @Paul sorry for the late reply, but I received a msg that the question wasn't posted. I tried your solution with a larger dataset, and it didn't work. I posted an image of an more detailed example. My actual datasets range around 4k records. Thank you so much for your help!!
I tried it with data set example that I added in the post-- I typed out the array, to help out on the typing :-) and I got error: sw = np.where(m[:-1] != m[1:])[0] + 1 TypeError: 'bool' object is not subscriptable
m is a bool not an array
both of d1 and d2 are arrays. sorry, but I just started numpy so I'm trying to figure things out as I go. thanks again.
@yanci, ok the problem seems to be that your a and b have different lengths. (35 and 39). I thought that wasn't allowed. If it is, you'd have to better explain by which rules your groups of four can be aligned to match them.
|
1

You can use matrix masking like,

import numpy as np
from scipy.sparse import dia_matrix
a = np.array([1,5,7,8,10,12,6,1,3,6,8,9])
b = np.array([1,5,7,8,11,15,6,1,3,6,10,6])
mask = dia_matrix((np.ones((1, a.size)).repeat(4, axis=0), np.arange(4)),
                  shape=(a.size, b.size), dtype=np.int)
print(mask.toarray())
matches = a[mask.T.dot(mask.dot(a == b) == 4).astype(np.bool)]
print(matches)

This will output,

array([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])
[1 5 7 8 6 1 3 6]

You can think about how the matrix multiplication works to get this result.

Scaling

For scaling, I tested with 1e3, 1e5, and 1e7 elements and got,

1e3 - 0.019184964010491967
1e5 - 0.4330314120161347
1e7 - 144.54082221200224

See the gist. Not sure why such a hard jump at 1e7 elements.

6 Comments

That's clever, but does it require that the two lists are the same lenght?
Clever, indeed, but maybe use a sparse matrix, so it doesn't become O(n^2)?
Hi @aidan.plenert.macdonald, I tried your solution with a larger dataset, and I received a too many indices for array error. I posted an image of a more detailed example. My actual datasets range around 4k records. Thank you so much for your help!!
@innisfree As currently written, yes, but the actual masking trick doesn't require same length. I'm updating it now, but I think the OP only cared about same length lists.
@PaulPanzer Changed to sparse. Actually makes the code more intuitive.
|
0

This is an exercise is list comprehension. We have the data

data = [1,5,7,8,10,12,6,1,3,6,8,9]
search_data = [1,5,7,8,11,15,6,1,3,6,10,6]

First we can chunk the original data into blocks of length n

n = 4
chunks = [data[i:i + n] for i in range(len(data) - n + 1)]
search_chunks = [search_data[i:i + n] for i in range(len(search_data) - n + 1)]

Now we must select chunks from the first list that appear in the second list

hits = [c for c in chunks if c in search_chunks]
print hits
# [[1, 5, 7, 8], [6, 1, 3, 6]]

This may not be the optimal solution for long lists. It may improve performance to consider sets, if there are likely to repeated chunks

chunks = set(tuple(data[i:i + n]) for i in range(len(data) - n + 1))
search_chunks = set(tuple(search_data[i:i + n]) for i in range(len(search_data) - n + 1))

This can be quite competitive with above numpy solution, e.g.

import numpy as np
import time

# Generate data

len_ = 10000
max_ = 10
data = map(int, np.random.rand(len_) * max_)
search_data = map(int, np.random.rand(len_) * max_)

# Time list comprehension

start = time.time()

n = 4
chunks = set(tuple(data[i:i + n]) for i in range(len(data) - n + 1))
search_chunks = set(tuple(search_data[i:i + n]) for i in range(len(search_data) - n + 1))
hits = [c for c in chunks if c in search_chunks]

print time.time() - start

# Time numpy

a = np.array(data)
b = np.array(search_data)
mask = 1 * (np.abs(np.arange(a.size).reshape((-1, 1)) - np.arange(a.size) - 0.5) < 2)

start = time.time()

matches = a[mask.T.dot(mask.dot(a == b) == 4).astype(np.bool)]

print time.time() - start

It's typically faster here, but it depends on number of repeated chunks etc.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.