3

I have a definition in python that

  1. Iterates over a sorted distinct array of Floats
  2. Gets the previous and next item
  3. Finds out if they are within a certain range of each other
  4. averages them out, and replaces the original values with the averaged value
  5. rerun through that loop until there are no more changes
  6. returns a distinct array

The issue is that it is extremely slow. the array "a" could be 100k+ and it takes 7-10 minutes to complete

I found that I needed to iterate over the array after the initial iteration because after averaging, sometimes the average values could be within range to be averaged again

I thought about breaking it into chunks and use multiprocessing, my concern is the end of one chunk, and the beginning of the next chunk would need to be averaged too.

def reshape_arr(a, close):
    """Iterates through 'a' to find values +- 'close', and averages them, then returns a distinct array of values"""
    flag = True
    while flag:
        array = a.sort_values().unique()
        l = len(array)
        flag = False
        for i in range(l):
            previous_item = next_item = None
            if i > 0:
                previous_item = array[i - 1]
            if i < (l - 1):
                next_item = array[i + 1]
            if previous_item is not None:
                if abs(array[i] - previous_item) < close:
                    average = (array[i] + previous_item) / 2
                    flag = True
                    #find matching values in a, and replace with the average
                    a.replace(previous_item, value=average, inplace=True)
                    a.replace(array[i], value=average, inplace=True)

            if next_item is not None:
                if abs(next_item - array[i]) < close:
                    flag = True
                    average = (array[i] + next_item) / 2
                    # find matching values in a, and replace with the average
                    a.replace(array[i], value=average, inplace=True)
                    a.replace(next_item, value=average, inplace=True)
    return a.unique()

a is a Pandas.Series from a DataFrame of anything between 0 and 200k rows, and close is an int (100 for example)

it works, just very slow.

5
  • If both the previous and the next item are close, then the average with next_item will override the one from previous_item. Is this desired? And why are you sorting the array on every iteration? Replacing with the average does not invalidate the sorting (i.e. you only need to sort once at the beginning). Commented Aug 9, 2019 at 14:57
  • if all 3 are close, the desired outcome is that all 3 are averaged. I was reassigning the array with the latest values, and in case they are not in order, sorting them. I would definitely still have to use '.unique()' but I could refactor and only sort once. Commented Aug 9, 2019 at 15:20
  • Rather than testing if previous_item is not None and if next_item is not None for every single element of every iteration of the loop, maybe you could run your loop from the second item to the second-to-last item where you are assured there is a previous and a next item? Commented Aug 9, 2019 at 15:41
  • @KevenScharaswak I'm still not sure if I fully understand that replacement logic. If the checks for both the previous and the next element are successful then array[i] gets replaced with the prev. average but next_item gets replaced nevertheless with its own average. Consequently, as an example, the following series will be reduced to [4.5]: pd.Series(list(range(10)), index=range(10)). Can you confirm this is desired? Commented Aug 9, 2019 at 23:00
  • @a_guest you were right, I didnt need to replace both items in each iteration. When I really broke it down, i only needed to replace the previous_item on the first iteration, and next_item on the last iteration Commented Aug 12, 2019 at 14:23

3 Answers 3

3

First, if the length of the input array a is large and close is relatively small, your proposed algorithm may be numerically unstable.

That being said, here are some ideas that reduce the time complexity from O(N^3) to O(N) (for an approximate implementation) or O(N^2) (for an equivalent implementation). For N=100, this gives a speedup up to a factor of 6000 for some choices of arr and close.

Consider an input array arr = [a,b,c,d], and suppose that close > d - a. In this case, the algorithm proceeds as follows:

[a,b,c,d]
[(a+b)/2,(b+c)/2,(c+d)/2]
[(a+2b+c)/4,(b+2c+d)/4]
[(a+3b+3c+d)/8]

One can recognize that if [x_1, x_2, ..., x_n] is a maximal contiguous subarray of arr s.t. x_i - x_{i-1} < close, then [x_1, x_2, ..., x_n] eventually evaluates to (sum_{k=0}^{k=n} x_k * c_{n,k})/(2^(n-1)) where c_{n,k} is the binomial coefficient n choose k.

This gives an O(N) implementation as follows:

import numpy as np
from scipy.stats import binom
from scipy.special import comb


def binom_mean(arr, scipy_cutoff=64):
    """
    Given an array arr, returns an average of arr
    weighted by binomial coefficients.
    """
    n = arr.shape[0]
    if arr.shape[0] == 1:
        return arr[0]
    # initializing a scipy binomial random variable can be slow
    # so, if short runs are likely, we can speed things up
    # by doing explicit computations
    elif n < scipy_cutoff:
        return np.average(arr, weights=comb(n-1, np.arange(n), exact=False))
    else:
        f = binom(n-1, 0.5).pmf
        return np.average(arr, weights=f(np.arange(n)))


def reshape_arr_binom(arr, close):
    d = np.ediff1d(arr, to_begin=0) < close
    close_chunks = np.split(arr, np.where(~d)[0])
    return np.fromiter(
        (binom_mean(c) for c in close_chunks), 
        dtype=np.float
    )

The result is within 10e-15 of your implementation for np.random.seed(0);N=1000;cost=1/N;arr=np.random.rand(N). However, for large N, this may not be meaningful unless cost is small. For the above parameter values, this is 270 times faster than the original code on my machine.

However, if we choose a modest value of N = 100 and set close to a large value like 1, the speedup is by a factor of 6000. This is because for large values of close, the original implementation is O(N^3); specifically, a.replace is potentially called O(N^2) times and has a cost O(N). So, maximal speedup is achieved when contiguous elements are likely to be close.


For the reference, here is an O(N^2) implementation that is equivalent to your code (I do not recommend using this in practice).

import pandas as pd
import numpy as np

np.random.seed(0)


def custom_avg(arr, indices, close):
    new_indices = list()
    last = indices[-1]
    for i in indices:
        if arr[i] - arr[i-1] < close:
            new_indices.append(i)
            avg = (arr[i-1] + arr[i]) / 2
            arr[i-1] = avg
            if i != last and arr[i+1] - arr[i] >= close:
                arr[i] = avg
    return new_indices


def filter_indices(indices):
    new_indices = list()
    second_dups = list()
    # handle empty index case
    if not indices:
        return new_indices, second_dups

    for i, j in zip(indices, indices[1:]):
        if i + 1 == j:
            # arr[i] is guaranteed to be different from arr[i-1]
            new_indices.append(i)
        else:
            # arr[i+1] is guaranteed to be a duplicate of arr[i]
            second_dups.append(i)
    second_dups.append(indices[-1])
    return new_indices, second_dups


def reshape_arr_(arr, close):
    indices = range(1, len(arr))
    dup_mask = np.zeros(arr.shape, bool)
    while indices:
        indices, second_dups = filter_indices(custom_avg(arr, indices, close))
        # print(f"n_inds = {len(indices)};\tn_dups = {len(second_dups)}")
        dup_mask[second_dups] = True
    return np.unique(arr[~dup_mask])

The basic ideas are the following:

First, consider two adjacent elements (i,j) with j = i + 1. If arr[j] - arr[i] >= close in current iteration, arr[j] - arr[i] >= close also holds after the current iteration. This is because arr[i] can only decrease and arr[j] can only increase. So, if (i,j) pair is not averaged in the current iteration, it will not be averaged in any of the subsequent iterations. So, we can avoid looking at (i,j) in the future.

Second, if (i,j) is averaged and (i+1,j+1) is not, we know that arr[i] is a duplicate of arr[j]. Also, the last modified element in each iteration is always a duplicate.

Based on these observations, we need to process fewer and fewer indices in each iteration. The worst case is still O(N^2), which can be witnessed by setting close = arr.max() - arr.min() + 1.

Some benchmarks:

from timeit import timeit


make_setup = """
from __main__ import np, pd, reshape_arr, reshape_arr_, reshape_arr_binom
np.random.seed(0)
arr = np.sort(np.unique(np.random.rand({N})))
close = {close}""".format

def benchmark(N, close):
    np.random.seed(0)
    setup = make_setup(N=N, close=close)
    print('Original:')
    print(timeit(
        stmt='reshape_arr(pd.Series(arr.copy()), close)',
        # setup='from __main__ import reshape_arr; import pandas as pd',
        setup=setup,
        number=1,
        ))
    print('Quadratic:')
    print(timeit(
        stmt='reshape_arr_(arr.copy(), close)',
        setup=setup,
        number=10,
        ))
    print('Binomial:')
    print(timeit(
        stmt='reshape_arr_binom(arr.copy(), close)',
        setup=setup,
        number=10,
        ))

if __name__ == '__main__':
    print('N=10_000, close=1/N')
    benchmark(10_000, 1/10_000)
    print('N=100, close=1')
    benchmark(100, 1)

# N=10_000, close=1/N
# Original:
# 14.855983458999999
# Quadratic:
# 0.35902471400000024
# Binomial:
# 0.7207887170000014
# N=100, close=1
# Original:
# 4.132993569
# Quadratic:
# 0.11140068399999947
# Binomial:
# 0.007650813999998007

The following table shows how the number of pairs we need to look at in the quadratic algorithm goes down each iteration.

n_inds = 39967;         n_dups = 23273
n_inds = 25304;         n_dups = 14663
n_inds = 16032;         n_dups = 9272
n_inds = 10204;         n_dups = 5828
n_inds = 6503;          n_dups = 3701
n_inds = 4156;          n_dups = 2347
n_inds = 2675;          n_dups = 1481
n_inds = 1747;          n_dups = 928
n_inds = 1135;          n_dups = 612
n_inds = 741;           n_dups = 394
n_inds = 495;           n_dups = 246
n_inds = 327;           n_dups = 168
n_inds = 219;           n_dups = 108
n_inds = 145;           n_dups = 74
n_inds = 95;            n_dups = 50
n_inds = 66;            n_dups = 29
n_inds = 48;            n_dups = 18
n_inds = 36;            n_dups = 12
n_inds = 26;            n_dups = 10
n_inds = 20;            n_dups = 6
n_inds = 15;            n_dups = 5
n_inds = 10;            n_dups = 5
n_inds = 6;             n_dups = 4
n_inds = 3;             n_dups = 3
n_inds = 1;             n_dups = 2
n_inds = 0;             n_dups = 1
Sign up to request clarification or add additional context in comments.

5 Comments

I can't run your test_old() function on my machine. Leaving the pd.Series in place it gives me KeyError: 0 at return arr[0] which makes sense since if arr is a Series it should probably be iloc. If I remove the transformation to Series the two functions test_new and test_old don't give the same result. Could you confirm that the code in your answer is the exact code you ran on your machine and that it does run without error? If so, what versions of numpy and pandas do you have?
This probably happened because I named my function reshape_arr, which overwrote the OPs definition.
I used the O(n^2) approach and it is working very well within limits (A thing of beauty). Thank you very much for the help.
@hilberts_drinking_problem Indeed, I managed to reproduce your timings (approximately). For the N = 10_000 I got factor of ~8 speedup with my approach compared to the "Quadratic" of your answer; for N = 100 it had similar timing. The approach with binomial coefficients is an interesting observation but it seems only be effective for large values of close; anyway nice solution! For more benchmarks, please see this community wiki.
Very nice. Looks like numpy wins again.
2

You can use the following function to produce similar output to yours (with the difference that the result from your function is unsorted since a is never sorted outside the loop and pd.Series.unique returns values in order of appearance; if this is actually desired, check the second function). Sorting the array on every loop iteration is not required since replacing with the average of two subsequent (unique) items in a sorted array will not invalidate the sorting. Since on every iteration the comparison with next_item will be the comparison with prev_item during the next iteration you can just compare subsequent elements in a pairwise manner once.

def solve_sorted(a, close):
    """Returns the reduced unique values as a sorted array."""
    a = a.sort_values().values.astype(float)
    while True:
        a = np.unique(a)
        comp = a[1:] - a[:-1] < close
        if not comp.sum():
            break
        indices = np.tile(comp.nonzero()[0][:, None], (1, 2))
        indices[:, 1] += 1
        avg = a[indices].mean(axis=1)
        a[indices.ravel()] = np.repeat(avg, 2)
    return np.unique(a)

If it is important to preserve the original order of elements then you can store the reverse sorting indices once at the beginning to restore the original order at the end:

def solve_preserve_order(a, close):
    """Returns the reduced unique values in their original order."""
    reverse_indices = np.argsort(np.argsort(a.values))
    a = a.sort_values()
    while True:
        b = a.unique()
        comp = b[1:] - b[:-1] < close
        if not comp.sum():
            break
        indices = np.tile(comp.nonzero()[0][:, None], (1, 2))
        indices[:, 1] += 1
        avg = b[indices].mean(axis=1)
        a.replace(b[indices.ravel()], np.repeat(avg, 2), inplace=True)
    return a.iloc[reverse_indices].unique()

Comments

2

Performance comparison

Testing the performance of the different presented algorithms for sorted, unique-valued input arrays (code attached below). Functions:

Performance scaling with the size of the input array

Using close = 1 / arr.size.

Scaling with array size (1e2 - 1e4)

Scaling with array size (1e4 - 1e6)

Scaling with the interval length

Using arr.size == 1_000; close is the interval length.

Scaling with interval length

Source code

"""Performance plots.
   Assuming a sorted, unique-valued array as an input.
   Function names have format `a<id>_*` where <id> is the answer's id."""

from timeit import timeit
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import perfplot
from scipy.stats import binom
from scipy.special import comb


def OP_reshape_arr(a, close):
    flag = True
    while flag:
        array = a.sort_values().unique()
        l = len(array)
        flag = False
        for i in range(l):
            previous_item = next_item = None
            if i > 0:
                previous_item = array[i - 1]
            if i < (l - 1):
                next_item = array[i + 1]
            if previous_item is not None:
                if abs(array[i] - previous_item) < close:
                    average = (array[i] + previous_item) / 2
                    flag = True
                    a.replace(previous_item, value=average, inplace=True)
                    a.replace(array[i], value=average, inplace=True)

            if next_item is not None:
                if abs(next_item - array[i]) < close:
                    flag = True
                    average = (array[i] + next_item) / 2
                    a.replace(array[i], value=average, inplace=True)
                    a.replace(next_item, value=average, inplace=True)
    return a.unique()


def _binom_mean(arr, scipy_cutoff=64):
    n = arr.shape[0]
    if arr.shape[0] == 1:
        return arr[0]
    elif n < scipy_cutoff:
        return np.average(arr, weights=comb(n-1, np.arange(n), exact=False))
    else:
        f = binom(n-1, 0.5).pmf
        return np.average(arr, weights=f(np.arange(n)))

def a57438948_reshape_arr_binom(arr, close):
    d = np.ediff1d(arr, to_begin=0) < close
    close_chunks = np.split(arr, np.where(~d)[0])
    return np.fromiter(
        (_binom_mean(c) for c in close_chunks), 
        dtype=np.float
    )


def _custom_avg(arr, indices, close):
    new_indices = list()
    last = indices[-1]
    for i in indices:
        if arr[i] - arr[i-1] < close:
            new_indices.append(i)
            avg = (arr[i-1] + arr[i]) / 2
            arr[i-1] = avg
            if i != last and arr[i+1] - arr[i] >= close:
                arr[i] = avg
    return new_indices

def _filter_indices(indices):
    new_indices = list()
    second_dups = list()
    if not indices:
        return new_indices, second_dups
    for i, j in zip(indices, indices[1:]):
        if i + 1 == j:
            new_indices.append(i)
        else:
            second_dups.append(i)
    second_dups.append(indices[-1])
    return new_indices, second_dups

def a57438948_reshape_arr_(arr, close):
    indices = range(1, len(arr))
    dup_mask = np.zeros(arr.shape, bool)
    while indices:
        indices, second_dups = _filter_indices(_custom_avg(arr, indices, close))
        dup_mask[second_dups] = True
    return np.unique(arr[~dup_mask])


def a57438149_solve_sorted(a, close):
    while True:
        comp = a[1:] - a[:-1] < close
        if not comp.sum():
            break
        indices = np.tile(comp.nonzero()[0][:, None], (1, 2))
        indices[:, 1] += 1
        avg = a[indices].mean(axis=1)
        a[indices.ravel()] = np.repeat(avg, 2)
        a = np.unique(a)
    return a


np.random.seed(0)
a = np.unique(np.random.rand(10_000))
c = 1/a.size
ref = OP_reshape_arr(pd.Series(a.copy()), c)
test = [
    a57438948_reshape_arr_binom(a.copy(), c),
    a57438948_reshape_arr_(a.copy(), c),
    a57438149_solve_sorted(a, c),
]
assert all(x.shape == ref.shape and np.allclose(x, ref) for x in test)

colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']

perfplot.bench(
    setup = lambda n: np.random.seed(0) or (np.unique(np.random.rand(n)), 1/n),
    kernels=[
        lambda x: OP_reshape_arr(pd.Series(x[0].copy()), x[1]),
        lambda x: a57438948_reshape_arr_binom(x[0].copy(), x[1]),
        lambda x: a57438948_reshape_arr_(x[0].copy(), x[1]),
        lambda x: a57438149_solve_sorted(x[0], x[1]),
    ],
    labels=['OP_reshape_arr', 'reshape_arr_binom', 'reshape_arr_', 'solve_sorted'],
    n_range=np.logspace(2, 4, 8).astype(int),
    xlabel='size of initial array (before np.unique; using interval length of 1/n)',
    logx=True,
    logy=True,
    colors=colors,
    automatic_order=False,
).plot()
plt.gca().set_xlim([1e2, 1e4])
plt.gca().set_ylim([1e-4, 20])
plt.savefig('scaling_with_array_size.png')
plt.close()

np.random.seed(0)
a = np.unique(np.random.rand(1_000_000))
c = 1/a.size
test = [
    a57438948_reshape_arr_binom(a.copy(), c),
    a57438948_reshape_arr_(a.copy(), c),
    a57438149_solve_sorted(a, c),
]
assert all(x.shape == test[0].shape and np.allclose(x, test[0]) for x in test)

perfplot.bench(
    setup = lambda n: np.random.seed(0) or (np.unique(np.random.rand(n)), 1/n),
    kernels=[
        lambda x: a57438948_reshape_arr_binom(x[0].copy(), x[1]),
        lambda x: a57438948_reshape_arr_(x[0].copy(), x[1]),
        lambda x: a57438149_solve_sorted(x[0], x[1]),
    ],
    labels=['reshape_arr_binom', 'reshape_arr_', 'solve_sorted'],
    n_range=np.logspace(4, 6, 5).astype(int),
    xlabel='size of initial array (before np.unique; using interval length of 1/n)',
    logx=True,
    logy=True,
    colors=colors[1:],
    automatic_order=False,
).plot()
plt.gca().set_xlim([1e4, 1e6])
plt.gca().set_ylim([5e-4, 10])
plt.savefig('scaling_with_array_size_2.png')
plt.close()

perfplot.bench(
    setup = lambda n: np.random.seed(0) or (np.unique(np.random.rand(1_000)), n),
    kernels=[
        lambda x: OP_reshape_arr(pd.Series(x[0].copy()), x[1]),
        lambda x: a57438948_reshape_arr_binom(x[0].copy(), x[1]),
        lambda x: a57438948_reshape_arr_(x[0].copy(), x[1]),
        lambda x: a57438149_solve_sorted(x[0], x[1]),
    ],
    labels=['OP_reshape_arr', 'reshape_arr_binom', 'reshape_arr_', 'solve_sorted'],
    n_range=np.logspace(-6, -2, 16),
    xlabel='length of interval (using array of size 1,000)',
    logx=True,
    logy=True,
    colors=colors,
    automatic_order=False,
).plot()
plt.gca().set_xlim([1e-6, 1e-2])
plt.gca().set_ylim([2e-5, 1e3])
plt.savefig('scaling_with_interval_length.png')
plt.close()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.