Numpy: For every element in one array, find the index in another array

Question

I have two 1D arrays, x & y, one smaller than the other. I'm trying to find the index of every element of y in x.

I've found two naive ways to do this, the first is slow, and the second memory-intensive.

The slow way

indices= []
for iy in y:
    indices += np.where(x==iy)[0][0]

The memory hog

xe = np.outer([1,]*len(x), y)
ye = np.outer(x, [1,]*len(y))
junk, indices = np.where(np.equal(xe, ye))

Is there a faster way or less memory intensive approach? Ideally the search would take advantage of the fact that we are searching for not one thing in a list, but many things, and thus is slightly more amenable to parallelization. Bonus points if you don't assume that every element of y is actually in x.

RomanS · Accepted Answer · 2016-04-15 07:22:08Z

57

I want to suggest one-line solution:

indices = np.where(np.in1d(x, y))[0]

The result is an array with indices for x array which corresponds to elements from y which were found in x.

One can use it without numpy.where if needs.

answered Apr 15, 2016 at 7:22

RomanS

9338 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Wilmer E. Henao Over a year ago

This should be the chosen answer. It works even when values of x are repeated or non-existent. The answer involving searchsorted is complex, weird, unnatural.

ccbunney Over a year ago

Whilst this does return the indices of the elements in y that exist in x, the order of the returned indices does not match the order of the values in x. Consider: x=np.array([1,2,3,4,5]; y=np.array([5,4,3,2,1]). The above method returns array([0,1,2,3,4]), so x[0]=1 is matched to y[0]=5, which is not what is wanted...

hermidalc Over a year ago

in1d() solutions just do not work. Take y = np.array([10, 5, 5, 1, 'auto', 6, 'auto', 1, 5, 10, 10, 'auto']) and x = np.array(['auto', 5, 6, 10, 1]). You would expect [3, 1, 1, 4, 0, 2, 0, 4, 3, 3, 0]. np.where(np.in1d(x, y))[0] doesn't yield that.

Brian Pollack Over a year ago

This simply states whether the elements in x exists in y, and then gives the corresponding index in x. It does not give the corresponding index in y for each item in x.

j-i-l · Accepted Answer · 2015-08-03 01:25:55Z

52

As Joe Kington said, searchsorted() can search element very quickly. To deal with elements that are not in x, you can check the searched result with original y, and create a masked array:

import numpy as np
x = np.array([3,5,7,1,9,8,6,6])
y = np.array([2,1,5,10,100,6])

index = np.argsort(x)
sorted_x = x[index]
sorted_index = np.searchsorted(sorted_x, y)

yindex = np.take(index, sorted_index, mode="clip")
mask = x[yindex] != y

result = np.ma.array(yindex, mask=mask)
print result

the result is:

[-- 3 1 -- -- 6]

edited Aug 3, 2015 at 1:25

j-i-l

11.1k3 gold badges60 silver badges74 bronze badges

answered Nov 24, 2011 at 3:02

HYRY

97.8k28 gold badges197 silver badges192 bronze badges

1 Comment

Nikolay Hidalgo Diaz Over a year ago

or, if -1 at non-found y position is fine result = yindex.copy(); result[mask] = -1; return result

Joe Kington · Accepted Answer · 2011-11-24 02:45:59Z

39

How about this?

It does assume that every element of y is in x, (and will return results even for elements that aren't!) but it is much faster.

import numpy as np

# Generate some example data...
x = np.arange(1000)
np.random.shuffle(x)
y = np.arange(100)

# Actually preform the operation...
xsorted = np.argsort(x)
ypos = np.searchsorted(x[xsorted], y)
indices = xsorted[ypos]

answered Nov 24, 2011 at 2:45

Joe Kington

287k73 gold badges621 silver badges474 bronze badges

2 Comments

Chris Over a year ago

Fantastic. Much faster indeed. I'm including assert na.all(na.intersect1d(x,y) == na.sort(y)) to restrict the input so that y is a subset of x. Thanks!

root-11 Over a year ago

This works if y is a subset of x. Otherwise IndexError will be raised.

hermidalc · Accepted Answer · 2019-07-22 17:16:31Z

15

I think this is a clearer version:

np.where(y.reshape(y.size, 1) == x)[1]

than indices = np.where(y[:, None] == x[None, :])[1]. You don't need to broadcast x into 2D.

This type of solution I found to be best because unlike searchsorted() or in1d() based solutions that have seen posted here or elsewhere, the above works with duplicates and it doesn't care if anything is sorted. This was important to me because I wanted x to be in a particular custom order.

edited Jul 22, 2019 at 17:16

answered Sep 12, 2018 at 14:31

hermidalc

5696 silver badges12 bronze badges

3 Comments

Mad Physicist Over a year ago

Clearer does not mean less inefficient.

Roman J. Over a year ago

I guess you can make a further simplification y.reshape(-1, 1)

Dmitriy Work Over a year ago

Actually np.where(y[:, None] == x)[1] is enough.

Jun Saito · Accepted Answer · 2016-11-06 12:24:23Z

8

I would just do this:

indices = np.where(y[:, None] == x[None, :])[1]

Unlike your memory-hog way, this makes use of broadcast to directly generate 2D boolean array without creating 2D arrays for both x and y.

answered Nov 6, 2016 at 12:24

Jun Saito

971 silver badge2 bronze badges

3 Comments

romeric Over a year ago

For the record, this hogs the memory as well.

Jun Saito Over a year ago

Yes, what I meant is it is less memory-hogging. I think my version is a good compromise in keeping the code clean while taking up less memory.

Alex Kaszynski Over a year ago

This approach clocks in at 1000x slower than the accepted answer.

Eelco Hoogendoorn · Accepted Answer · 2016-04-15 10:42:26Z

The numpy_indexed package (disclaimer: I am its author) contains a function that does exactly this:

import numpy_indexed as npi
indices = npi.indices(x, y, missing='mask')

It will currently raise a KeyError if not all elements in y are present in x; but perhaps I should add a kwarg so that one can elect to mark such items with a -1 or something.

It should have the same efficiency as the currently accepted answer, since the implementation is along similar lines. numpy_indexed is however more flexible, and also allows to search for indices of rows of multidimensional arrays, for instance.

EDIT: ive changed the handling of missing values; the 'missing' kwarg can now be set with 'raise', 'ignore' or 'mask'. In the latter case you get a masked array of the same length of y, on which you can call .compressed() to get the valid indices. Note that there is also npi.contains(x, y) if this is all you need to know.

NSVR · Accepted Answer · 2022-08-30 12:03:21Z

2

Another solution would be:

a = np.array(['Bob', 'Alice', 'John', 'Jack', 'Brian', 'Dylan',])
z = ['Bob', 'Brian', 'John']
for i in z:
    print(np.argwhere(i==a))

answered Aug 30, 2022 at 12:03

NSVR

3021 gold badge3 silver badges13 bronze badges

Comments

Kaushal Gupta · Accepted Answer · 2019-07-22 17:20:59Z

1

Use this line of code :-

indices = np.where(y[:, None] == x[None, :])[1]

answered Jul 22, 2019 at 17:20

Kaushal Gupta

1668 bronze badges

Comments

Stefan · Accepted Answer · 2022-03-13 18:25:35Z

My solution can additionally handle a multidimensional x. By default, it will return a standard numpy array of corresponding y indices in the shape of x.

If you can't assume that y is a subset of x, then set masked=True to return a masked array (this has a performance penalty). Otherwise, you will still get indices for elements not contained in y, but they probably won't be useful to you.

The answers by HYRY and Joe Kington were helpful in making this.

# For each element of ndarray x, return index of corresponding element in 1d array y
# If y contains duplicates, the index of the last duplicate is returned
# Optionally, mask indices where the x element does not exist in y

def matched_indices(x, y, masked=False):
    # Flattened x
    x_flat = x.ravel()

    # Indices to sort y
    y_argsort = y.argsort()

    # Indices in sorted y of corresponding x elements, flat
    x_in_y_sort_flat = y.searchsorted(x_flat, sorter=y_argsort)

    # Indices in y of corresponding x elements, flat
    x_in_y_flat = y_argsort[x_in_y_sort_flat]

    if not masked:
        # Reshape to shape of x
        return x_in_y_flat.reshape(x.shape)

    else:
        # Check for inequality at each y index to mask invalid indices
        mask = x_flat != y[x_in_y_flat]
        # Reshape to shape of x
        return np.ma.array(x_in_y_flat.reshape(x.shape), mask=mask.reshape(x.shape))

AcK · Accepted Answer · 2022-04-01 10:04:32Z

1

Comments

Selva · Accepted Answer · 2017-12-21 11:11:07Z

0

A more direct solution, that doesn't expect the array to be sorted.

import pandas as pd
A = pd.Series(['amsterdam', 'delhi', 'chromepet', 'tokyo', 'others'])
B = pd.Series(['chromepet', 'tokyo', 'tokyo', 'delhi', 'others'])

# Find index position of B's items in A
B.map(lambda x: np.where(A==x)[0][0]).tolist()

Result is:

[2, 3, 3, 1, 4]

answered Dec 21, 2017 at 11:11

Selva

2,1431 gold badge24 silver badges19 bronze badges

Collectives™ on Stack Overflow

Numpy: For every element in one array, find the index in another array

The slow way

The memory hog

11 Answers 11

4 Comments

1 Comment

2 Comments

3 Comments

3 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

The slow way

The memory hog

11 Answers 11

4 Comments

1 Comment

2 Comments

3 Comments

3 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related