I have a long 1-d numpy array with some 10% missing values. I want to change its missing values (np.nan) to other values repeatedly. I know of two ways to do this:
data[np.isnan(data)] = 0 or the function
np.copyto(data, 0, where=np.isnan(data))
Sometimes I want to put zeros there, other times I want to restore the nans.
I thought that recomputing the np.isnan function repeatedly would be slow, and it would be better to save the locations of the nans. Some of the timing results of the code below are counter-intuitive.
I ran the following:
import numpy as np
import sys
print(sys.version)
print(sys.version_info)
print(f'numpy version {np.__version__}')
data = np.random.random(100000)
data[data<0.1] = 0
data[data==0] = np.nan
%timeit missing = np.isnan(data)
%timeit wheremiss = np.where(np.isnan(data))
missing = np.isnan(data)
wheremiss = np.where(np.isnan(data))
print("Use missing list store 0")
%timeit data[missing] = 0
data[data==0] = np.nan
%timeit data[wheremiss] = 0
data[data==0] = np.nan
%timeit np.copyto(data, 0, where=missing)
print("Use isnan function store 0")
data[data==0] = np.nan
%timeit data[np.isnan(data)] = 0
data[data==0] = np.nan
%timeit np.copyto(data, 0, where=np.isnan(data))
print("Use missing list store np.nan")
data[data==0] = np.nan
%timeit data[missing] = np.nan
data[data==0] = np.nan
%timeit data[wheremiss] = np.nan
data[data==0] = np.nan
%timeit np.copyto(data, np.nan, where=missing)
print("Use isnan function store np.nan")
data[data==0] = np.nan
%timeit data[np.isnan(data)] = np.nan
data[data==0] = np.nan
%timeit np.copyto(data, np.nan, where=np.isnan(data))
And I got the following output (I have taken the liberty to add numbers to the timing lines, so that I can refer to them later):
3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 22:01:29) [MSC v.1900 64 bit (AMD64)]
sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)
numpy version 1.17.1
01. 30 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
02. 219 µs ± 24.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use missing list store 0
03. 339 µs ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
04. 26 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
05. 287 µs ± 26.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use isnan function store 0
06. 38.5 µs ± 2.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
07. 43.8 µs ± 4.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Use missing list store np.nan
08. 328 µs ± 30.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
09. 24.8 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
10. 322 µs ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use isnan function store np.nan
11. 356 µs ± 31.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
12. 300 µs ± 4.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So here is the first question. Why would it take nearly 10 times longer to store a np.nan than to store a 0? (compare lines 6 and 7 vs. lines 11 and 12)
Why would it take much longer to use a stored list of missing as compared to recomputing the missing values using the isnan function? (compare lines 3 and 5 vs. 6 and 7)
This is just for curiosity. I can see that the fastest way is to use np.where to get a list of indices (because I have only 10% missing). But if I had many more, things might not be so obvious.