First, if the length of the input array a is large and close is relatively small, your proposed algorithm may be numerically unstable.
That being said, here are some ideas that reduce the time complexity from O(N^3) to O(N) (for an approximate implementation) or O(N^2) (for an equivalent implementation). For N=100, this gives a speedup up to a factor of 6000 for some choices of arr and close.
Consider an input array arr = [a,b,c,d], and suppose that close > d - a. In this case, the algorithm proceeds as follows:
[a,b,c,d]
[(a+b)/2,(b+c)/2,(c+d)/2]
[(a+2b+c)/4,(b+2c+d)/4]
[(a+3b+3c+d)/8]
One can recognize that if [x_1, x_2, ..., x_n] is a maximal contiguous subarray of arr s.t. x_i - x_{i-1} < close, then [x_1, x_2, ..., x_n] eventually evaluates to (sum_{k=0}^{k=n} x_k * c_{n,k})/(2^(n-1)) where c_{n,k} is the binomial coefficient n choose k.
This gives an O(N) implementation as follows:
import numpy as np
from scipy.stats import binom
from scipy.special import comb
def binom_mean(arr, scipy_cutoff=64):
"""
Given an array arr, returns an average of arr
weighted by binomial coefficients.
"""
n = arr.shape[0]
if arr.shape[0] == 1:
return arr[0]
# initializing a scipy binomial random variable can be slow
# so, if short runs are likely, we can speed things up
# by doing explicit computations
elif n < scipy_cutoff:
return np.average(arr, weights=comb(n-1, np.arange(n), exact=False))
else:
f = binom(n-1, 0.5).pmf
return np.average(arr, weights=f(np.arange(n)))
def reshape_arr_binom(arr, close):
d = np.ediff1d(arr, to_begin=0) < close
close_chunks = np.split(arr, np.where(~d)[0])
return np.fromiter(
(binom_mean(c) for c in close_chunks),
dtype=np.float
)
The result is within 10e-15 of your implementation for np.random.seed(0);N=1000;cost=1/N;arr=np.random.rand(N). However, for large N, this may not be meaningful unless cost is small. For the above parameter values, this is 270 times faster than the original code on my machine.
However, if we choose a modest value of N = 100 and set close to a large value like 1, the speedup is by a factor of 6000. This is because for large values of close, the original implementation is O(N^3); specifically, a.replace is potentially called O(N^2) times and has a cost O(N). So, maximal speedup is achieved when contiguous elements are likely to be close.
For the reference, here is an O(N^2) implementation that is equivalent to your code (I do not recommend using this in practice).
import pandas as pd
import numpy as np
np.random.seed(0)
def custom_avg(arr, indices, close):
new_indices = list()
last = indices[-1]
for i in indices:
if arr[i] - arr[i-1] < close:
new_indices.append(i)
avg = (arr[i-1] + arr[i]) / 2
arr[i-1] = avg
if i != last and arr[i+1] - arr[i] >= close:
arr[i] = avg
return new_indices
def filter_indices(indices):
new_indices = list()
second_dups = list()
# handle empty index case
if not indices:
return new_indices, second_dups
for i, j in zip(indices, indices[1:]):
if i + 1 == j:
# arr[i] is guaranteed to be different from arr[i-1]
new_indices.append(i)
else:
# arr[i+1] is guaranteed to be a duplicate of arr[i]
second_dups.append(i)
second_dups.append(indices[-1])
return new_indices, second_dups
def reshape_arr_(arr, close):
indices = range(1, len(arr))
dup_mask = np.zeros(arr.shape, bool)
while indices:
indices, second_dups = filter_indices(custom_avg(arr, indices, close))
# print(f"n_inds = {len(indices)};\tn_dups = {len(second_dups)}")
dup_mask[second_dups] = True
return np.unique(arr[~dup_mask])
The basic ideas are the following:
First, consider two adjacent elements (i,j) with j = i + 1. If arr[j] - arr[i] >= close in current iteration, arr[j] - arr[i] >= close also holds after the current iteration. This is because arr[i] can only decrease and arr[j] can only increase. So, if (i,j) pair is not averaged in the current iteration, it will not be averaged in any of the subsequent iterations. So, we can avoid looking at (i,j) in the future.
Second, if (i,j) is averaged and (i+1,j+1) is not, we know that arr[i] is a duplicate of arr[j]. Also, the last modified element in each iteration is always a duplicate.
Based on these observations, we need to process fewer and fewer indices in each iteration. The worst case is still O(N^2), which can be witnessed by setting close = arr.max() - arr.min() + 1.
Some benchmarks:
from timeit import timeit
make_setup = """
from __main__ import np, pd, reshape_arr, reshape_arr_, reshape_arr_binom
np.random.seed(0)
arr = np.sort(np.unique(np.random.rand({N})))
close = {close}""".format
def benchmark(N, close):
np.random.seed(0)
setup = make_setup(N=N, close=close)
print('Original:')
print(timeit(
stmt='reshape_arr(pd.Series(arr.copy()), close)',
# setup='from __main__ import reshape_arr; import pandas as pd',
setup=setup,
number=1,
))
print('Quadratic:')
print(timeit(
stmt='reshape_arr_(arr.copy(), close)',
setup=setup,
number=10,
))
print('Binomial:')
print(timeit(
stmt='reshape_arr_binom(arr.copy(), close)',
setup=setup,
number=10,
))
if __name__ == '__main__':
print('N=10_000, close=1/N')
benchmark(10_000, 1/10_000)
print('N=100, close=1')
benchmark(100, 1)
# N=10_000, close=1/N
# Original:
# 14.855983458999999
# Quadratic:
# 0.35902471400000024
# Binomial:
# 0.7207887170000014
# N=100, close=1
# Original:
# 4.132993569
# Quadratic:
# 0.11140068399999947
# Binomial:
# 0.007650813999998007
The following table shows how the number of pairs we need to look at in the quadratic algorithm goes down each iteration.
n_inds = 39967; n_dups = 23273
n_inds = 25304; n_dups = 14663
n_inds = 16032; n_dups = 9272
n_inds = 10204; n_dups = 5828
n_inds = 6503; n_dups = 3701
n_inds = 4156; n_dups = 2347
n_inds = 2675; n_dups = 1481
n_inds = 1747; n_dups = 928
n_inds = 1135; n_dups = 612
n_inds = 741; n_dups = 394
n_inds = 495; n_dups = 246
n_inds = 327; n_dups = 168
n_inds = 219; n_dups = 108
n_inds = 145; n_dups = 74
n_inds = 95; n_dups = 50
n_inds = 66; n_dups = 29
n_inds = 48; n_dups = 18
n_inds = 36; n_dups = 12
n_inds = 26; n_dups = 10
n_inds = 20; n_dups = 6
n_inds = 15; n_dups = 5
n_inds = 10; n_dups = 5
n_inds = 6; n_dups = 4
n_inds = 3; n_dups = 3
n_inds = 1; n_dups = 2
n_inds = 0; n_dups = 1
next_itemwill override the one fromprevious_item. Is this desired? And why are you sorting the array on every iteration? Replacing with the average does not invalidate the sorting (i.e. you only need to sort once at the beginning).array[i]gets replaced with the prev. average butnext_itemgets replaced nevertheless with its own average. Consequently, as an example, the following series will be reduced to[4.5]:pd.Series(list(range(10)), index=range(10)). Can you confirm this is desired?