Best suitable sorting algorithm

Question

I have a hashtable that may contain around 1-5 million records. I need to iterate over it to select some of the entries in it and then sort them in some order. I was planning to use a linked list to maintain a list of pointers to the entries in the hashtable that I have to sort. But using the linked list, the only available good option for sorting that I came across is merge sort. But considering that the list may contain around 5 million records, should merge sort be used? I have no restriction to use linked list only to maintain the list of pointers. I can also use arrays so that I can use heap sort. But deciding on the size of this array would be a challenging task considering that this complete operation is pretty frequent and different instances of it can run in parallel. Also, the number of entries that are filtered out from the hashtable for sorting can vary from 1 to almost all records in the hastable. Can someone suggest me what approach would best fit here?

What's the size of each record? Your approach for using linked list (this DS is called LinkedHashSet) and using mergesort sounds perfectly fine to me, is it talking too much time on your experiments? — amit
– amit, Commented May 7, 2014 at 6:55
Each element in the linked list would contain just 2 pointers: pointer to an entry in hashtable and other to the next node. So on a 64 bit system, each record size would be 16bytes. I haven't tested for time taken, but was concerned if too many 16byte nodes on stack memory would cause a problem.(AFAIK merge sort uses recursion and hence would use stack only) — labyrinth
– labyrinth, Commented May 7, 2014 at 7:04
A merge sort is an excellent option. With a linked list the sort needs no extra space. And the algorithm is very efficient. Sorting 5 million items with merge sort would be trivial on modern hardware. How often is "pretty frequent?" — Jim Mischel
– Jim Mischel, Commented May 7, 2014 at 10:12

unwind · Accepted Answer · 2014-05-07 06:55:14Z

6

Try the simplest approach first:

Implement a typical dynamic array, using realloc() for growth, and perhaps using the typical double-allocation-when-growing scheme. Growing to a million elements will then take about 20 re-allocations.
Sort the array with qsort().

Then profile that, and see where it hurts. If you're not memory-sensitive, increase the initial allocation for the array.

answered May 7, 2014 at 6:55

unwind

402k64 gold badges492 silver badges620 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user3386109 Over a year ago

Given a pointer size of 4 bytes, a max array size of 5 million, and a modern machine (i.e. several GB of DRAM), it's not unreasonable to simply start with an allocation of 20MB for the array. It will be faster than thrashing the memory with realloc.

labyrinth Over a year ago

But considering that this operation might run in parallel, I will need around 10-12 arrays of 20MB size. Will that be a good option?

user3386109 Over a year ago

As long as the arrays fit into physical memory (i.e. don't get swapped to disk by the virtual memory manager), then this approach works well. 12 arrays at 20MB is 0.25GB, so that shouldn't be a problem unless you are already running near the limits of the memory.

Collectives™ on Stack Overflow

Best suitable sorting algorithm

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related