How to replace values in array numpy using np.where for loop?

Question

I have a list that contains a number of different labels and I would like to change the labels to something different.

Where, original = [1,1,1,2,2,2,3,3,3,3]

And I want to change each values with, modified = [1,3,8]

so the output would look be original_modified = [1,1,1,3,3,3,8,8,8,8]

So far what I have done is this for loop below:

for x, y in zip(np.unique(original), modified):
    original_modified = np.where(original == x, original, y)

However, I am not getting the intended results as the output is incorrect, and I'm not quite sure as to why.

I understand I could get a way with a simple for loop with if conditions however, I am not sure if this would be a very dynamic solution.

Any help is appreciated, thanks.

You shouldn't use np.unique, as it sorts the values. In this specific example, it works (original is ordered), but in general it doesn't. Try with [2,2,2,10,10,10,0,0,0] and [1,3,8]. Also, np.unique() is relatively slow (because it sorts). — Pierre D
– Pierre D, Commented Apr 6, 2022 at 14:35
Are the labels always numbers like 1,2,3? consecutive integers? If you have arrays instead of lists, can you use original as indices: `modified[original-1]' — hpaulj
– hpaulj, Commented Apr 6, 2022 at 15:44

Pierre D · Accepted Answer · 2022-04-06 14:31:18Z

Without numpy:

out = list(map(dict(zip({k:0 for k in original}.keys(), modified)).get, original))

>>> out
[1, 1, 1, 3, 3, 3, 8, 8, 8, 8]

Explanation

So why does it work?

{k:0 for k in original} is a way to find the distinct values in original, in insertion order (unlike set where order is undefined). It is a dict where the keys are the distinct values, and the value is always 0.
once we have that, we take the keys() and zip with the modified values into a dict. E.g.
```
>>> dict(zip({k:0 for k in original}.keys(), modified))
{1: 1, 2: 3, 3: 8}
```
we then use that as a map to replace the original values with map(_the_mapping_dict_.get, original).

Addendum: alternatives and performance

Here are a few other ways to achieve the same result, and how long they take.

def pure_py(om):
    """Pure Python"""
    original, modified = om
    return list(map(dict(zip({k: 0 for k in original}.keys(), modified)).get, original))

def py_with_pd_unique(om):
    """Using a dict for replacement, but using pd.unique() to get the unique values"""
    original, modified = om
    return list(map(dict(zip(pd.unique(original), modified)).get, original))

def np_select(om):
    """Using np.select and assuming inputs are np.array"""
    original, modified = om
    return np.select([original == v for v in pd.unique(original)], modified, original)

def vect_dict_get(om):
    """Using a vectorized dict.get()"""
    original, modified = om
    d = dict(zip(pd.unique(original), modified))
    return np.apply_along_axis(np.vectorize(d.get), 0, original)

Then:

import perfplot
from math import isqrt

def setup(n):
    original = np.random.randint(0, isqrt(n), n)
    modified = np.arange(len(pd.unique(original)))
    return original, modified

perfplot.show(
    setup=setup,
    n_range=[4 ** k for k in range(4, 11)],
    kernels=[
        pure_py,
        py_with_pd_unique,
        np_select,
        vect_dict_get,
    ],
    xlabel='len(original)',
)

Conclusion: py_with_pd_unique is the fastest through the range. For 1M elements in original, it is almost twice as fast as the rest:

o, m = setup(1_000_000)

%timeit pure_py((o, m))
# 209 ms ± 359 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit py_with_pd_unique((o, m))
# 108 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

joanis · Accepted Answer · 2022-04-06 15:19:49Z

1

I found two problems in your code.

In your loop, you should read from original_modified instead of starting from original again at each iteration.
You reversed the last two arguments to np.where().

This code works:

original_modified = original
for x, y in zip(np.unique(original), modified):
    original_modified = np.where(original == x, y, original_modified)

PS: As @PierreD pointed out, np.unique() might not be the right choice, since it sorts its results. If you need to preserve the order in which elements first appear in original, use pd.unique() instead.

edited Apr 6, 2022 at 15:19

answered Apr 6, 2022 at 13:02

joanis

13k23 gold badges38 silver badges50 bronze badges

1 Comment

Pierre D Over a year ago

this will give the wrong answer: np.unique() sorts the values. pd.unique() doesn't (keeps the insertion order, and is faster: O[n] instead of O[n log n]).

Davide_sd · Accepted Answer · 2022-04-06 12:58:03Z

1

Are you really constrained in using np.where? If not, an alternative solution might be:

import numpy as np
original = np.array([1,1,1,2,2,2,3,3,3,3])
modified = original.copy()
d = {2: 3, 3: 8}
for k, v in d.items():
    modified[original == k] = v
print(modified)
# array([1, 1, 1, 3, 3, 3, 8, 8, 8, 8])

answered Apr 6, 2022 at 12:58

Davide_sd

13.7k3 gold badges26 silver badges35 bronze badges

Comments

Alexander L. Hayes · Accepted Answer · 2022-04-06 13:00:19Z

1

np.where will return a boolean index to where the condition is satisfied.

Indexing + assignment should do this:

import numpy as np

original = np.array([1,1,1,2,2,2,3,3,3,3])
out = np.empty(original.shape, dtype=int)
modified = [1, 3, 8]

for x, y in zip(np.unique(original), modified):
    out[np.where(original == x)] = y

print(out)
# [1 1 1 3 3 3 8 8 8 8]

answered Apr 6, 2022 at 13:00

Alexander L. Hayes

4,3034 gold badges16 silver badges40 bronze badges

Collectives™ on Stack Overflow

How to replace values in array numpy using np.where for loop?

4 Answers 4

Explanation

Addendum: alternatives and performance

Comments

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Explanation

Addendum: alternatives and performance

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related