7

Suppose I have an example of numpy array:

import numpy as np
X = np.array([2,5,0,4,3,1])

And I also have a list of arrays, like:

A = [np.array([-2,0,2]), np.array([0,1,2,3,4,5]), np.array([2,5,4,6])]

I want to leave only these items of each list that are also in X. I expect also to do it in a most efficient/common way.

Solution I have tried so far:

  1. Sort X using X.sort().
  2. Find locations of items of each array in X using:

    locations = [np.searchsorted(X, n) for n in A]
    
  3. Leave only proper ones:

    masks = [X[locations[i]] == A[i] for i in range(len(A))]
    result = [A[i][masks[i]] for i in range(len(A))]
    

But it doesn't work because locations of third array is out of bounds:

locations = [array([0, 0, 2], dtype=int64), array([0, 1, 2, 3, 4, 5], dtype=int64), array([2, 5, 4, 6], dtype=int64)]

How to solve this issue?

Update

I ended up with idx[idx==len(Xs)] = 0 solution. I've also noticed two different approaches posted between the answers: transforming X into set vs np.sort. Both of them has plusses and minuses: set operations uses iterations which is quite slow in compare with numpy methods; however np.searchsorted speed increases logarithmically unlike acceses of set items which is instant. That why I decided to compare performance using data with huge sizes, especially 1 million items for X, A[0], A[1], A[2].

enter image description here

enter image description here

2 Answers 2

2

One idea would be less compute and minimal work when looping. So, here's one with those in mind -

a = np.concatenate(A)
m = np.isin(a,X)
l = np.array(list(map(len,A)))
a_m = a[m]
cut_idx = np.r_[0,l.cumsum()]
l_m = np.add.reduceat(m,cut_idx[:-1])
cl_m = np.r_[0,l_m.cumsum()]
out = [a_m[i:j] for (i,j) in zip(cl_m[:-1],cl_m[1:])]

Alternative #1 :

We can also use np.searchsorted to get the isin mask, like so -

Xs = np.sort(X)
idx = np.searchsorted(Xs,a)
idx[idx==len(Xs)] = 0
m = Xs[idx]==a

Another way with np.intersect1d

If you are looking for the most common/elegant one, think it would be with np.intersect1d -

In [43]: [np.intersect1d(X,A_i) for A_i in A]
Out[43]: [array([0, 2]), array([0, 1, 2, 3, 4, 5]), array([2, 4, 5])]

Solving your issue

You can also solve your out-of-bounds issue, with a simple fix -

for l in locations:
    l[l==len(X)]=0
Sign up to request clarification or add additional context in comments.

2 Comments

I'm sure using intersect1d is a good expression of my idea but I work with pretty huge X and A so that's why I have decided to sort X first.
@mathfux I think you'll find sorting very large arrays is much slower than using set functions.
2

How about this, very simple and efficent:

import numpy as np
X = np.array([2,5,0,4,3,1])
A = [np.array([-2,0,2]), np.array([0,1,2,3,4,5]), np.array([2,5,4,6])]

X_set = set(X)
A = [np.array([a for a in arr if a in X_set]) for arr in A]
#[array([0, 2]), array([0, 1, 2, 3, 4, 5]), array([2, 5, 4])]

According to the docs, set operations all have O(1) complexity, therefore the overall is O(N)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.