8

I have two numpy arrays of integers, both of length several hundred million. Within each array values are unique, and each is initially unsorted.

I would like the indices to each that yield their sorted intersection. For example:

x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])

Then the sorted intersection of these is [4, 5, 11], and so we want the indices that turn each of x and y into that array, so we want it to return:

mx = np.array([0, 3, 6])
my = np.array([2, 1, 4])

since then x[mx] == y[my] == np.intersect1d(x, y)

The only solution we have so far involves three different argsorts, so it seems that is unlikely to be optimal.

Each value represents a galaxy, in case that makes the problem more fun.

2
  • See also: stackoverflow.com/q/11483863/1461210 Commented Nov 4, 2015 at 21:04
  • You should accept Rory Yorke's answer - it's a lot faster than mine Commented Nov 28, 2015 at 20:32

4 Answers 4

3

Here's an option based on intersect1d's implementation, which is fairly straightforward. It requires one call to argsort.

The admittedly simplistic test passes.

import numpy as np


def my_intersect(x, y):
    """my_intersect(x, y) -> xm, ym
    x, y: 1-d arrays of unique values
    xm, ym: indices into x and y giving sorted intersection
    """
    # basic idea taken from numpy.lib.arraysetops.intersect1d
    aux = np.concatenate((x, y))
    sidx = aux.argsort()
    # Note: intersect1d uses aux[:-1][aux[1:]==aux[:-1]] here - I don't know why the first [:-1] is necessary
    inidx = aux[sidx[1:]] == aux[sidx[:-1]]

    # quicksort is not stable, so must do some work to extract indices
    # (if stable, sidx[inidx.nonzero()]  would be for x)
    # interlace the two sets of indices, and check against lengths
    xym = np.vstack((sidx[inidx.nonzero()],
                     sidx[1:][inidx.nonzero()])).T.flatten()

    xm = xym[xym < len(x)]
    ym = xym[xym >= len(x)] - len(x)

    return xm, ym


def check_my_intersect(x, y):
    mx, my = my_intersect(x, y)
    assert (x[mx] == np.intersect1d(x, y)).all()

    # not really necessary: np.intersect1d returns a sorted list
    assert (x[mx] == sorted(x[mx])).all()
    assert (x[mx] == y[my]).all()


def random_unique_unsorted(n):
    while True:
        x = np.unique(np.random.randint(2*n, size=n))
        if len(x):
            break
    np.random.shuffle(x)
    return x


x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])

check_my_intersect(x, y)


for i in range(20):
    x = random_unique_unsorted(100+i)
    y = random_unique_unsorted(200+i)
    check_my_intersect(x, y)

Edit: "Note" comment was confusing (Used ... as speech ellipsis, forgot it was a Python operator too).

Sign up to request clarification or add additional context in comments.

Comments

3

You could also use np.searchsorted, like so -

def searchsorted_based(x,y):

    # Get argsort for both x and y
    xsort_idx = x.argsort()
    ysort_idx = y.argsort()

    # Sort x and y and store them
    X = x[xsort_idx]
    Y = y[ysort_idx]

    # Find positions of Y in X and the matches by the positions that 
    # shift between 'left' and 'right' based searches. 
    # Use the matches posotions to get corresponding argsort for X.
    x1 = np.searchsorted(X,Y,'left') 
    x2 = np.searchsorted(X,Y,'right') 
    out1 = xsort_idx[x1[x2 != x1]]

    # Repeat for X in Y findings
    y1 = np.searchsorted(Y,X,'left') 
    y2 = np.searchsorted(Y,X,'right')
    out2 = ysort_idx[y1[y2 != y1]]

    return out1, out2

Sample run -

In [100]: x = np.array([4, 1, 10, 5, 8, 13, 11])
     ...: y = np.array([20, 5, 4, 9, 11, 7, 25])
     ...: 

In [101]: searchsorted_based(x,y)
Out[101]: (array([0, 3, 6]), array([2, 1, 4]))

Comments

3

For a pure numpy solution you could do something like this:

  1. Use np.unique to get the unique values and corresponding indices in x and y separately:

    # sorted unique values in x and y and the indices corresponding to their first
    # occurrences, such that u_x == x[u_idx_x]
    u_x, u_idx_x = np.unique(x, return_index=True)
    u_y, u_idx_y = np.unique(y, return_index=True)
    
  2. Find the intersection of the unique values using np.intersect1d:

    # we can assume_unique, which can be faster for large arrays
    i_xy = np.intersect1d(u_x, u_y, assume_unique=True)
    
  3. Finally, use np.in1d to select only the indices that correspond to unique values in x or y that also happen to be in the intersection of x and y:

    # it is also safe to assume_unique here
    i_idx_x = u_idx_x[np.in1d(u_x, i_xy, assume_unique=True)]
    i_idx_y = u_idx_y[np.in1d(u_y, i_xy, assume_unique=True)]
    

To pull all that together into a single function:

def intersect_indices(x, y):
    u_x, u_idx_x = np.unique(x, return_index=True)
    u_y, u_idx_y = np.unique(y, return_index=True)
    i_xy = np.intersect1d(u_x, u_y, assume_unique=True)
    i_idx_x = u_idx_x[np.in1d(u_x, i_xy, assume_unique=True)]
    i_idx_y = u_idx_y[np.in1d(u_y, i_xy, assume_unique=True)]
    return i_idx_x, i_idx_y

For example:

x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])

i_idx_x, i_idx_y = intersect_indices(x, y)

print(i_idx_x, i_idx_y)
# (array([0, 3, 6]), array([2, 1, 4]))

Speed test:

In [1]: k = 1000000

In [2]: %%timeit x, y = np.random.randint(k, size=(2, k))
intersect_indices(x, y)
....: 
1 loops, best of 3: 597 ms per loop

Update:

I initially missed the fact that in your case both x and y contain only unique values. Taking that into account, it's possible to do slightly better by using an indirect sort:

def intersect_indices_unique(x, y):
    u_idx_x = np.argsort(x)
    u_idx_y = np.argsort(y)
    i_xy = np.intersect1d(x, y, assume_unique=True)
    i_idx_x = u_idx_x[x[u_idx_x].searchsorted(i_xy)]
    i_idx_y = u_idx_y[y[u_idx_y].searchsorted(i_xy)]
    return i_idx_x, i_idx_y

Here's a more realistic test case, where x and y both contain unique (but partially overlapping) values:

In [1]: n, k = 10000000, 1000000

In [2]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2))
intersect_indices(x, y)
....: 
1 loops, best of 3: 593 ms per loop

In [3]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2))
intersect_indices_unique(x, y)
....: 
1 loops, best of 3: 453 ms per loop

@Divakar's solution is very similar in terms of performance:

In [4]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2))
searchsorted_based(x, y)
....: 
1 loops, best of 3: 472 ms per loop

Comments

0

Maybe a pure Python solutions using a dict works for you:

def indices_from_values(a, intersect):
    idx = {value: index for index, value in enumerate(a)}
    return np.array([idx[x] for x in intersect])

intersect = np.intersect1d(x, y)
mx = indices_from_values(x, intersect)
my = indices_from_values(y, intersect)
np.allclose(x[mx], y[my]) and np.allclose(x[mx], np.intersect1d(x, y))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.