Select and filter algorithm

Question

I would like to select the top n values from a dataset, but ignore elements based on what I have already selected - i.e., given a set of points (x,y), I would like to select the top 100 values of x (which are all distinct) but not select any points such that y equals the y of any already-selected point. I would like to make sure that the highest values of x are prioritized.

Is there any existing algorithm for this, or at least similar ones? I have a huge amount of data and would like to do this as efficiently as possible. Memory is not as much of a concern.

Do you have a sample file please? Any indication of Operating System and toolset? — Mark Setchell
– Mark Setchell, Commented Aug 13, 2019 at 21:51
This is conceptually an easy algorithm to implement. Please post your attempt and ill write up a class to implement it for you. The current answers explain what to do though! — Fallenreaper
– Fallenreaper, Commented Aug 13, 2019 at 22:40

Richard · Accepted Answer · 2019-08-13 22:35:20Z

1

You can do this in O(n log k) time where n is the number of values in the dataset and k are the number of top values you'd like to get.

Store the values you wish to exclude in a hash table.
Make an empty min-heap.
Iterate over all of the values and for each value:
1. If it is in the hash table skip it.
2. If the heap contains fewer than k values, add it to the heap.
3. If the heap contains >=k values, if the value you're looking at is greater than the smallest member of the minheap, pop that value and add the new one.

answered Aug 13, 2019 at 22:35

Richard

62.7k40 gold badges198 silver badges277 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mangusta · Accepted Answer · 2019-08-13 22:35:18Z

I will share my thoughts and since the author still has not specified the scope of data to be processed, I will assume that it is too large to be handled by a single machine and I will also assume that the author is familiar with Hadoop.
So I would suggest using the MapReduce as follows:

Mappers simply emit pairs (x,y)
Combiners select k pairs with largest values of x (k=100 in this case) in the meantime maintaining the unique y's in the hashset to avoid duplicates, then emit k pairs found.
There should be only one reducer in this job since it has to get all pairs from combiners to finalize the job by selecting k pairs for the last time. Reducer's implementation is identical to combiner.

The number of combiners should be selected considering memory resources needed to select top k pairs out of incoming dataset since whichever method is used (sorting, heap or anything else) it is going to be done in-memory, as well as keeping that hashset with unique y's

Collectives™ on Stack Overflow

Select and filter algorithm

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related