1

I have a 2-d array of shape(nx3), say arr1. Now consider a second array, arr2, of same shape as arr1 and has the same rows. However, the rows are not in the same order. I want to get the indices of each row in arr2 as they are in arr1. I am looking for fastest Pythonic way to do this as n is of the order of 10,000.

For example:

arr1 = numpy.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2 = numpy.array([[4, 5, 6], [7, 8, 9], [1, 2, 3]])
ind = [1, 2, 0]

Note that the row elements need not be integers. In fact they are floats. I have found related answers that use numpy.searchsorted but they work for 1-D arrays only.

1
  • Can you share your code ? Commented Jul 24, 2015 at 3:47

3 Answers 3

3

If you are ensure that arr2 is a permutation of arr1, you can use sort to get the index:

import numpy as np

n = 100000
a1 = np.random.randint(0, 100, size=(n, 3))
a2 = a1[np.random.permutation(np.arange(n))]
idx1 = np.lexsort(a1.T)
idx2 = np.lexsort(a2.T)
idx = idx2[np.argsort(idx1)]
np.all(a1 == a2[idx])

if they don't have exact the same values, you can use kdTree in scipy:

n = 100000

a1 = np.random.uniform(0, 100, size=(n, 3))
a2 = a1[np.random.permutation(np.arange(n))] + np.random.normal(0, 1e-8, size=(n, 3))
from scipy import spatial
tree = spatial.cKDTree(a2)
dist, idx = tree.query(a1)
np.allclose(a1, a2[idx])
Sign up to request clarification or add additional context in comments.

1 Comment

The second array is not a permutation of the first. The first array is fed through a function which performs some mathematical operations on the row elements to get arr2, and as the elements are floats the values may not be exactly the same. Thanks for pointing me in the direction of kdTree. This works!
0

Before we begin, you should mention whether duplicates can exist in your list.

That said, the method I would use is numpy's where function within a list comprehension like so:

[numpy.where(arr1 == x)[0][0] for x in arr2]

Though this might not be the fastest way. Another method might include building a dictionary from the rows in arr1 somehow and then looking them up with arr2.

1 Comment

No the rows are unique, there are no repetitions. However, I wanted to avoid using the for loop. I know this is a pythonic and very readable piece of code but the size of my arrays makes it very slow.
0

While this is very similar to: Find indexes of matching rows in two 2-D arrays I don't have the reputation to leave a comment.

However, based on that comment there appear to be two clear possibilities for a large matrix like yours:

def find_rows_searchsorted(a, b):
    dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))

    a_view = np.ascontiguousarray(a).view(dt).ravel()
    b_view = np.ascontiguousarray(b).view(dt).ravel()

    sort_b = np.argsort(b_view)
    where_in_b = np.searchsorted(b_view, a_view, sorter=sort_b)
    return np.take(sort_b, where_in_b)

def find_rows_iterative(a, b):
    answer = np.empty(a.shape[0], dtype=int)
    for idx, row in enumerate(a):
        answer[idx] = np.where(np.equal(b, row).all(1))[0]

    return answer

def find_rows_list_comprehension(a, b):
    return [np.where(b == x)[0][0] for x in a]

However, a little timing with a matrix of 10000 elements shows that the searchsorted based method is significantly faster than the brute force iterative method:

arr1 = np.random.randn(10000, 3)
shuffled_inds = np.arange(arr1.shape[0])
np.random.shuffle(shuffled_inds)
arr2 = arr1[new_inds, :]

np.array_equal(find_rows_searchsorted(arr2, arr1), new_inds)
>> True

np.array_equal(find_rows_iterative(arr2, arr1), new_inds)
>> True

np.array_equal(find_rows_list_comprehension(arr2, arr1), new_inds)
>> True

%timeit find_rows_iterative(arr2, arr1)
>> 1 loops, best of 3: 2.62 s per loop

%timeit find_rows_list_comprehension(arr2, arr1)
>> 1 loops, best of 3: 1.61 s per loop

%timeit find_rows_searchsorted(arr2, arr1)
>> 100 loops, best of 3: 6.53 ms per loop

Based off of HYRY's great responses I also added lexsort and kdball tests as well as a test of argsort for structured arrays.

def find_rows_lexsort(a, b):
    idx1 = np.lexsort(a.T)
    idx2 = np.lexsort(b.T)
    return idx2[np.argsort(idx1)]

def find_rows_argsort(a, b):
    a_rec  = np.core.records.fromarrays(a.transpose())
    b_rec  = np.core.records.fromarrays(b.transpose())
    idx1 = a_rec.argsort(order=a_rec.dtype.names).argsort()
    return b_rec.argsort(order=b_rec.dtype.names)[idx1]

def find_rows_kdball(a, b):
    from scipy import spatial
    tree = spatial.cKDTree(b)
    _, idx = tree.query(a)
    return idx

%timeit find_rows_lexsort(arr2, arr1)
>> 100 loops, best of 3: 4.63 ms per loop

%timeit find_rows_argsort(arr2, arr1)
>> 100 loops, best of 3: 7.37 ms per loop

%timeit find_rows_kdball(arr2, arr1)
>> 100 loops, best of 3: 18.5 ms per loop

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.