3

Given the following array:

a = np.array([[1,2,3],[4,5,6],[7,8,9]])

[[1 2 3]
 [4 5 6]
 [7 8 9]]

How can I replace certain values with other values?

bad_vals = [4, 2, 6]
update_vals = [11, 1, 8]

I currently use:

for idx, v in enumerate(bad_vals):
    a[a==v] = update_vals[idx]

Which gives:

[[ 1  1  3]
 [11  5  8]
 [ 7  8  9]]

But it is rather slow for large arrays with many values to be replaced. Is there any good alternative?

The input array can be changed to anything (list of list/tuples) if this might be necessary to access certain speedy black magic.

EDIT:

Based on the great answers from @Divakar and @charlysotelo did a quick comparison for my real use-case date using the benchit package. My input data array has more or less a of ratio 100:1 (rows:columns) where the length of array of replacement values are in order of 3 x rows size.

Functions:

# current approach
def enumerate_values(a, bad_vals, update_vals):
    for idx, v in enumerate(bad_vals):
        a[a==v] = update_vals[idx]
    return a

# provided solution @Divakar
def map_values(a, bad_vals, update_vals):
    N = max(a.max(), max(bad_vals))+1
    mapar = np.empty(N, dtype=int)
    mapar[a] = a
    mapar[bad_vals] = update_vals
    out = mapar[a]
    return out

# provided solution @charlysotelo
def vectorize_values(a, bad_vals, update_vals):
    bad_to_good_map = {}
    for idx, bad_val in enumerate(bad_vals):
        bad_to_good_map[bad_val] = update_vals[idx]
    f = np.vectorize(lambda x: (bad_to_good_map[x] if x in bad_to_good_map else x))
    a = f(a)

    return a

# define benchit input functions
import benchit
funcs = [enumerate_values, map_values, vectorize_values]

# define benchit input variables to bench against
in_ = {
    n: (
        np.random.randint(0,n*10,(n,int(n * 0.01))), # array
        np.random.choice(n*10, n*3,replace=False), # bad_vals
        np.random.choice(n*10, n*3) # update_vals
    ) 
    for n in [300, 1000, 3000, 10000, 30000]
}

# do the bench
# btw: timing of bad approaches (my own function here) take time
t = benchit.timings(funcs, in_, multivar=True, input_name='Len')
t.plot(logx=True, grid=False)

timings benchit

4
  • Are the values (positive) integral? Can we thus make a list like [0,1,1,3,11,5,8] (that thus defines the mapping) Commented Jun 3, 2020 at 21:52
  • You could use answers from Fast replacement of values in a numpy array by making a dictionary from bad_vals and update_vals. Commented Jun 3, 2020 at 21:57
  • @WillemVanOnsem Yes, all values are positive integers Commented Jun 3, 2020 at 21:58
  • @Divakar, yes will give! Had to sleep a bit.. Commented Jun 4, 2020 at 7:17

2 Answers 2

3

Here's one way based on the hinted mapping array method for positive numbers -

def map_values(a, bad_vals, update_vals):
    N = max(a.max(), max(bad_vals))+1
    mapar = np.empty(N, dtype=int)
    mapar[a] = a
    mapar[bad_vals] = update_vals
    out = mapar[a]
    return out

Sample run -

In [94]: a
Out[94]: 
array([[1, 2, 1],
       [4, 5, 6],
       [7, 1, 1]])

In [95]: bad_vals
Out[95]: [4, 2, 6]

In [96]: update_vals
Out[96]: [11, 1, 8]

In [97]: map_values(a, bad_vals, update_vals)
Out[97]: 
array([[ 1,  1,  1],
       [11,  5,  8],
       [ 7,  1,  1]])

Benchmarking

# Original soln
def replacevals(a, bad_vals, update_vals):
    out = a.copy()
    for idx, v in enumerate(bad_vals):
        out[out==v] = update_vals[idx]
    return out

The given sample had the 2D input of nxn with n samples to be replaced. Let's setup input datasets with the same structure.

Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.

import benchit
funcs = [replacevals, map_values]
in_ = {n:(np.random.randint(0,n*10,(n,n)),np.random.choice(n*10,n,replace=False),np.random.choice(n*10,n)) for n in [3,10,100,1000,2000]}
t = benchit.timings(funcs, in_, multivar=True, input_name='Len')
t.plot(logx=True, save='timings.png')

Plot :

enter image description here

Sign up to request clarification or add additional context in comments.

15 Comments

This is a really nice solution. It is 740X more quick than my solution for my real use case. Thanks for sharing this. Also nice benchit package. Let me try to see if I can combine the other solutions (which was 55X more quick than my approack) in a chart and update my answer with this. Thanks again!
@Mattijn Yeah you can just add any other approach into funcs = [replacevals, map_values] with the function name(s). Should be convenient that way. Would like to see your chart(s), if you would like to share.
@Divakar--Benchit looks interesting. How does benchit compare to Perfplot which I have used? Any advantages/disadvantages?
@Divakar--OK, will give it a try for my next benchmark. Two advantages I see benchit has are: 1) it shows the test environment information on the top left of the screen, 2) it has a nicer grid (horizontal & vertical) to display the results.
@Divakar--Thanks! I was able to run your basic test on the online Python with the new release. Adding `bench = "^0.0.3" to the Python spec file is needed for it to load benchit and its dependencies, although it still loads bench-it also.
|
2

This really depends on the size of your array, and the size of your mappings from bad to good integers.

For a larger number of bad to good integers - the method below is better:

import numpy as np
import time

ARRAY_ROWS = 10000
ARRAY_COLS = 1000

NUM_MAPPINGS = 10000

bad_vals = np.random.rand(NUM_MAPPINGS)
update_vals = np.random.rand(NUM_MAPPINGS)

bad_to_good_map = {}
for idx, bad_val in enumerate(bad_vals):
    bad_to_good_map[bad_val] = update_vals[idx]

# np.vectorize with mapping
# Takes about 4 seconds
a = np.random.rand(ARRAY_ROWS, ARRAY_COLS)
f = np.vectorize(lambda x: (bad_to_good_map[x] if x in bad_to_good_map else x))
print (time.time())
a = f(a)
print (time.time())


# Your way
# Takes about 60 seconds
a = np.random.rand(ARRAY_ROWS, ARRAY_COLS)
print (time.time())
for idx, v in enumerate(bad_vals):
    a[a==v] = update_vals[idx]
print (time.time())

Running the code above it took less than 4 seconds for the np.vectorize(lambda) way to finish - whereas your way took almost 60 seconds. However, setting the NUM_MAPPINGS to 100, your method takes less than a second for me - faster than the 2 seconds for the np.vectorize way.

1 Comment

Thanks a lot for sharing your solution which provided a 55X speedup compare to my solutions in my real data. While amazing, the solution provided by @Divakar had a speedup of 741X. Thanks again!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.