2

I have a hashtable that may contain around 1-5 million records. I need to iterate over it to select some of the entries in it and then sort them in some order. I was planning to use a linked list to maintain a list of pointers to the entries in the hashtable that I have to sort. But using the linked list, the only available good option for sorting that I came across is merge sort. But considering that the list may contain around 5 million records, should merge sort be used? I have no restriction to use linked list only to maintain the list of pointers. I can also use arrays so that I can use heap sort. But deciding on the size of this array would be a challenging task considering that this complete operation is pretty frequent and different instances of it can run in parallel. Also, the number of entries that are filtered out from the hashtable for sorting can vary from 1 to almost all records in the hastable. Can someone suggest me what approach would best fit here?

4
  • 1
    Have you considered a terasort on Hadoop cluster? Commented May 7, 2014 at 6:53
  • 1
    What's the size of each record? Your approach for using linked list (this DS is called LinkedHashSet) and using mergesort sounds perfectly fine to me, is it talking too much time on your experiments? Commented May 7, 2014 at 6:55
  • Each element in the linked list would contain just 2 pointers: pointer to an entry in hashtable and other to the next node. So on a 64 bit system, each record size would be 16bytes. I haven't tested for time taken, but was concerned if too many 16byte nodes on stack memory would cause a problem.(AFAIK merge sort uses recursion and hence would use stack only) Commented May 7, 2014 at 7:04
  • A merge sort is an excellent option. With a linked list the sort needs no extra space. And the algorithm is very efficient. Sorting 5 million items with merge sort would be trivial on modern hardware. How often is "pretty frequent?" Commented May 7, 2014 at 10:12

1 Answer 1

6

Try the simplest approach first:

  1. Implement a typical dynamic array, using realloc() for growth, and perhaps using the typical double-allocation-when-growing scheme. Growing to a million elements will then take about 20 re-allocations.
  2. Sort the array with qsort().

Then profile that, and see where it hurts. If you're not memory-sensitive, increase the initial allocation for the array.

Sign up to request clarification or add additional context in comments.

3 Comments

Given a pointer size of 4 bytes, a max array size of 5 million, and a modern machine (i.e. several GB of DRAM), it's not unreasonable to simply start with an allocation of 20MB for the array. It will be faster than thrashing the memory with realloc.
But considering that this operation might run in parallel, I will need around 10-12 arrays of 20MB size. Will that be a good option?
As long as the arrays fit into physical memory (i.e. don't get swapped to disk by the virtual memory manager), then this approach works well. 12 arrays at 20MB is 0.25GB, so that shouldn't be a problem unless you are already running near the limits of the memory.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.