0

I would like to select the top n values from a dataset, but ignore elements based on what I have already selected - i.e., given a set of points (x,y), I would like to select the top 100 values of x (which are all distinct) but not select any points such that y equals the y of any already-selected point. I would like to make sure that the highest values of x are prioritized.

Is there any existing algorithm for this, or at least similar ones? I have a huge amount of data and would like to do this as efficiently as possible. Memory is not as much of a concern.

3
  • how huge is the data? Commented Aug 13, 2019 at 21:02
  • Do you have a sample file please? Any indication of Operating System and toolset? Commented Aug 13, 2019 at 21:51
  • This is conceptually an easy algorithm to implement. Please post your attempt and ill write up a class to implement it for you. The current answers explain what to do though! Commented Aug 13, 2019 at 22:40

2 Answers 2

1

You can do this in O(n log k) time where n is the number of values in the dataset and k are the number of top values you'd like to get.

  1. Store the values you wish to exclude in a hash table.
  2. Make an empty min-heap.
  3. Iterate over all of the values and for each value:
    1. If it is in the hash table skip it.
    2. If the heap contains fewer than k values, add it to the heap.
    3. If the heap contains >=k values, if the value you're looking at is greater than the smallest member of the minheap, pop that value and add the new one.
Sign up to request clarification or add additional context in comments.

Comments

0

I will share my thoughts and since the author still has not specified the scope of data to be processed, I will assume that it is too large to be handled by a single machine and I will also assume that the author is familiar with Hadoop.
So I would suggest using the MapReduce as follows:

  1. Mappers simply emit pairs (x,y)
  2. Combiners select k pairs with largest values of x (k=100 in this case) in the meantime maintaining the unique y's in the hashset to avoid duplicates, then emit k pairs found.
  3. There should be only one reducer in this job since it has to get all pairs from combiners to finalize the job by selecting k pairs for the last time. Reducer's implementation is identical to combiner.

The number of combiners should be selected considering memory resources needed to select top k pairs out of incoming dataset since whichever method is used (sorting, heap or anything else) it is going to be done in-memory, as well as keeping that hashset with unique y's

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.