Optimal algorithm for returning top k values from an array of length N

Question

I have an array of n floats, and I wish to return the top k (in my case n ~ 100, k ~ 10)

Is there a known optimal solution path for this problem?

Could someone provide a C algorithm?

EDIT: actually there are two problems here: sorted and unsorted. I am interested in unsorted, which should be faster!

Take a look at the partial quicksort algorithm discussed towards the end of "Efficient selection and partial sorting based on quicksort". — NPE
– NPE, Commented Feb 10, 2011 at 11:41
The corresponding non broken link web.archive.org/web/20160110151823/http://blogs.msdn.com/b/… — Nicolas Filotto
– Nicolas Filotto, Commented Dec 4, 2020 at 19:35

score 54 · Accepted Answer · 2011-02-11 07:41:33Z

54

Method 1

Since k is small, you can use the tournament method to find the kth largest. This method is described in Knuth's Art of Programming, Volume 3, Page 212.

First create a tournament on n-k+2 elements. Something like a knockout tennis tournament. First you split into pairs and compare the members of the pairs (as if those two played a match and one lost). Then the winners, you split in to pairs again and so on, till you have a winner. You can view it as a tree, with the winner at the top.

This takes n-k+1 compares exactly.

Now the winner of these n-k+2 cannot be your kth largest element. Consider its path P up the tournament.

Of the remaining k-2 now pick one, and follow that up the path P which will give you a new largest. Basically you sort of redo the tournament with the previous winner being replaced by one of the k-2 elements. Let P be the path of the new winner. Now pick another from the k-3 and follow the new path up and so on.

At the end after you exhaust the k-2, replace the largest with -infinity and the largest of the tournament will be the kth largest. The elements you have thrown away are the top k-1 elements.

This takes at most n - k + (k-1) [log (n-k+2)] comparisons to find the top k. It uses O(n) memory though.

In terms of number of compares this should likely beat any selection algorithms.

Method 2

As an alternative, you can maintain a min-heap of k elements.

First insert k elements. Then for each element of array, if it is less than the min element of the heap, throw it away. Otherwise, delete-min of the heap and insert the element from the array.

At the end, the heap will contain the top k elements. This will take O(n log k) compares.

Of course, if n is small, just sorting the array should be good enough. The code will be simpler too.

edited Feb 11, 2011 at 7:41

answered Feb 10, 2011 at 18:30

Aryabhatta

Sign up to request clarification or add additional context in comments.

3 Comments

Aryabhatta Over a year ago

@Ohmu: There is some code here: blogs.sun.com/malkit/entry/finding_kth_minimum_partial_ordering but it is probably not exactly as this answer describes... It has some figures though :-)

seh Over a year ago

@Aryabhatta, that link is dead today. Here's the replacement: blogs.oracle.com/malkit/entry/….

Nicolas Filotto Over a year ago

The second link is also dead so let's get it from the web archive web.archive.org/web/20151002100306/https://blogs.oracle.com/…

IVlad · Accepted Answer · 2011-02-10 11:42:02Z

39

You can do this in O(n) by using a selection algorithm. Find the kth largest element with the partition algorithm, then all the elements after it will be larger than it, and those are your top k.

If you need those top k in sorted order, you can then sort them in O(k log k).

answered Feb 10, 2011 at 11:42

IVlad

43.6k13 gold badges115 silver badges184 bronze badges

2 Comments

aviggiano Over a year ago

Please note this algorithm is O(2n), which is O(n) in the end but very slow for many real-world applications.

NickLamp Over a year ago

2n steps is still much faster compared the nlogn or nlogk steps, unless you have an exceedingly small n or k. With log base 2, k would have to be 4 or less for this algorithm to be less efficient than an nlogk solution.

Phil Willoughby · Accepted Answer · 2011-02-10 12:07:54Z

Short answer: no.

Longer answer: yes, several mutually incompatible optimal solutions are known. It depends on n, k and what properties of the array you can guarantee.

If you know nothing about the array the lower bound of complexity is obviously O(n), because all elements of the source array must be examined to see if they fit in the top 10. If you know anything about the source array which allows elements to be safely skipped you should use that knowledge.

Similarly the upper complexity bound is O(n.log(n)) because you could always choose to find the answer by sorting the array (O(n.log(n)) and returning the first 10 items (O(1)).

A linear search comparing each item to the tenth-highest found so far and inserting it at the appropriate place in the list of highest-found-so-far items if needed has similar complexity for the average and best-case scenarios and has a worst-case of O(kn) which is significantly better than O(n-squared). For the sizes you estimate I expect this method to perform well.

If n were much larger (~10000) and k were increased in the same ratio it would probably be worthwhile implementing the quickselect algorithm. Quickselect performs better the more elements you want. If, however, k were not increased in scale with n you should stick with linear searching. Quickselect & friends modify the original array, so are less suitable if you cannot do this in place because you need a bunch more storage and a lot of copying that the algorithm complexity does not include.

If n is huge (~1e20) you would want to find the k largest from each of a number of partitions of the input array and then find the k-largest from the aggregate of those results, so that you were not trying to analyse more data than you can fit in memory at a time, and to allow the operation to be parallelised efficiently.

piyush121 · Accepted Answer · 2016-02-29 22:54:09Z

Following is a heap based elegant solution in Java with complexity O(nlogK). It's not the most efficient but i think it's easy enough to understand. You can change Integer to Float if you want a float based solution

import java.util.Arrays;
import java.util.PriorityQueue;

public class FindKLargest {

public static void find(int[] A, int k) {

    PriorityQueue<Integer> pq = new PriorityQueue<>(k);// Min heap because the element has to be greater
                                                        // than the smallest element in the heap in order
                                                        // to be qualified to be a member of top k elements.
    for (int i = 0; i < A.length; i++) {
        if (i < k) // add until heap is filled with k elements.
            pq.add(A[i]);
        else if (pq.peek() < A[i]) { // check if it's bigger than the
                                        // smallest element in the heap.
            pq.poll();
            pq.add(A[i]);
        }
    }
    int[] topK = new int[pq.size()];
    int index = 0;
    while (index != k)
        topK[index++] = pq.poll();
    System.out.println(Arrays.toString(topK));
}

public static void main(String[] args) {
    int[] arr = { 1, -2, -3, -4, -5 };
    find(arr, 4);
}

}

Magnus Robert Carl Wootton · Accepted Answer · 2015-05-10 19:10:33Z

if you got a fancy gpu, I can tell you how to compute the top huge k of huge n instances all at the same time, so spread em out on a texture, per instance, and add blend onto a texture with their "height" as the position along the texture.

But note you have to guess an acceptable range or know it, or you wont spread to your max detail you could have had.

you clone positions. (you should get a 2, if there is 2 on it, 10 if there is 10 on it.) across all instances. (just say its all on an 8192x8192 texture, 64x64 of these "height" boxes.) and you also skip slots with 0 counts.

then do a mipped add hierarchy, except you do it like a binary tree, you only treat like its 1 dimension, so take the 2 previous numbers and add them together, and keep doing it for every binary mip.

then we use these mips (which have collected counts) to discover the approximate location of the k, using all mips in the process, do this on a final thread, youll take huge chunks out of it, then slowly use the more detailed mips to find the per pixel value, that k sits at.

it makes more sense to do this, if it were all instanced again, then its a thread per threshold discovery. (just say you were running an ANN 128x128 times at once, (translational invarience anyone?) then it makes perfect sense.

and achieve the threshold height for that count, but its approximate... so you get an approximate k. for n lists.

You can do a little more work to get the exact k, but in a similarity match, but if you can get away with it being approximate, like if it were getting the top ~k activations, then dont worry about it.

But mag! why is it important to have the approximate top k? I think its important for a cheaper similarity match!!!

Anup Mondal · Accepted Answer · 2025-02-01 16:20:17Z

0

Create a maxheap and return the top K elements!

public static List<Integer> findTopN(int[] nums, int n) {
        
        if (nums == null || nums.length == 0 || n <= 0) {
            return Collections.emptyList();
        }

        PriorityQueue<Integer> maxHeap = new PriorityQueue<>(Collections.reverseOrder());

        // Insert all elements into the max heap
        for (int num : nums) {
            maxHeap.add(num);
        }

        // Extract the top N elements
        List<Integer> result = new ArrayList<>();
        for (int i = 0; i < n && !maxHeap.isEmpty(); i++) {
            result.add(maxHeap.poll());
        }

        return result;
    }

edited Feb 1 at 16:20

answered Jan 30 at 5:44

Anup Mondal

411 silver badge8 bronze badges

2 Comments

Jeremy Caney Jan 31 at 0:30

Thank you for your interest in contributing to the Stack Overflow community. This question already has existing answers. It would be useful to explain how your approach is different, under what circumstances your approach might be preferred, and/or why you think the previous answers aren’t sufficient. Can you kindly edit your answer to offer an explanation?

Anup Mondal Feb 1 at 16:18

My code is short and simple, its creating a maxheap and returning the top K elements.

Collectives™ on Stack Overflow

Optimal algorithm for returning top k values from an array of length N

6 Answers 6

3 Comments

2 Comments

Comments

Comments

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

2 Comments

Comments

Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related