(EDITED)
(EDIT2: added a more specialized JIT-ed version to address issues when using np.sort() with numba.)
(EDIT3: included timing for the recursive approach with median pivoting from @hilberts_drinking_problem's answer)
I am not 100% what you are after, because the first two lines of your code seems to be doing nothing, but after @hilberts_drinking_problem I have edited my answer, I assume that you have a typo and:
sum_ = np.sum(arr[:i])
should be:
sum_ = np.sum(asorted[:i])
Then, your solution can be written as a function like:
import numpy as np
def min_sum_threshold_orig(arr, threshold=0.5):
idx = np.argsort(arr)
arr_sorted = arr[idx][::-1]
sum_ = 0
i = 0
while sum_ < threshold:
i += 1
sum_ = np.sum(arr_sorted[:i])
return i
However:
- Instead of
np.argsort() and indexing you could use np.sort() directly
- there is no need to compute the entire sum at each iteration, but you could instead use the sum from the previous iteration
- Using a while loop is risky because if
threshold is sufficiently high (> 1.0 with your assumption) then the loop will never end
Addressing these points one gets to:
def min_sum_threshold(arr, threshold=0.5):
arr = np.sort(arr)[::-1]
sum_ = 0
for i in range(arr.size):
sum_ += arr[i]
if sum_ >= threshold:
break
return i + 1
In the above, the explicit looping becomes the bottleneck. A good way of addressing it, is to use numba:
import numba as nb
min_sum_threshold_nbn = nb.jit(min_sum_threshold)
min_sum_threshold_nbn.__name__ = 'min_sum_threshold_nbn'
But this may not be the most efficient approach as numba is relatively slow when new arrays are being created.
A possibly faster approach is to use arr.sort() instead of np.sort() because that is in-place, thus avoiding the creation of a new array:
@nb.jit
def min_sum_thres_nb_inplace(arr, threshold=0.5):
arr.sort()
sum_ = 0
for i in range(arr.size - 1, -1, -1):
sum_ += arr[i]
if sum_ >= threshold:
break
return arr.size - i
Alternatively, one could JIT only the portion of code after the sorting:
@nb.jit
def _min_sum_thres_nb(arr, threshold=0.5):
sum_ = 0.0
for i in range(arr.size):
sum_ += arr[i]
if sum_ >= threshold:
break
return i + 1
def min_sum_thres_nb(arr, threshold=0.5):
return _min_sum_thres_nb(np.sort(arr)[::-1], threshold)
The difference between the two will be minimal for larger inputs. For smaller one, min_sum_thres_nb() will be dominated by the comparatively slow extra function call.
Because of the pitfails in benchmarking functions which modify their input, min_sum_thres_nb_inplace() is omitted from benchmarks, with the understanding that for very small inputs is as fast as min_sum_thres_nbn() and for larger ones it has essentially the same performances as min_sum_thres_nb().
Alternatively one could use a vectorized approaches like in @yatu's answer:
def min_sum_threshold_np_sum(arr, threshold=0.5):
return np.sum(np.cumsum(np.sort(arr)[::-1]) < threshold) + 1
or, better, use np.searchsorted() which avoids creating an unnecessary temporary array with the comparison:
def min_sum_threshold_np_ss(arr, threshold=0.5):
return np.searchsorted(np.cumsum(np.sort(arr)[::-1]), threshold) + 1
or, assuming that sorting the entire array is unnecessarily expensive:
def min_sum_threshold_np_part(arr, threshold=0.5):
n = arr.size
m = np.int(size * threshold) + 1
part_arr = np.partition(arr, n - m)[n - m:]
return np.searchsorted(np.cumsum(np.sort(arr)[::-1]), threshold) + 1
An even more sophisticated approach using recursion and median pivoting is:
def min_sum_thres_rec(arr, threshold=0.5, cutoff=64):
n = arr.size
if n <= cutoff:
return np.searchsorted(np.cumsum(np.sort(arr)[::-1]), threshold) + 1
else:
m = n // 2
partitioned = np.partition(arr, m)
low = partitioned[:m]
high = partitioned[m:]
sum_high = np.sum(high)
if sum_high >= threshold:
return min_sum_thres_rec(high, threshold)
else:
return min_sum_thres_rec(low, threshold - sum_high) + high.size
(the last three are adapted from @hilberts_drinking_problem's answer)
Benchmarking these with the input generated from this:
def gen_input(n, a=0, b=10000):
arr = np.random.randint(a, b, n)
arr = arr / np.sum(arr)
return arr
gives the following:

These indicate that for small enough inputs, the numba approach is the fastest, but as soon as the input exceeds ~600 elements for the naive approach or ~900 for the optimized one, the NumPy approaches which uses np.partition(), while being less memory efficient, are faster.
Eventually, beyond ~4000 elements, the min_sum_thres_rec() gets faster than all other proposed methods.
It may be possible to write a faster numba-based implementation of this method.
Note that the optimized numba approach is strictly faster than the naive NumPy approaches that were tested.
sum_ = np.sum(a[:number])instead ofsum_ += a[number - 1]?a, and I cannot load them all because of memory issues.sum_ = np.sum(a[:number])is meant to besum_ = np.sum(asorted[:number])?