For educational purposes, I wrote a selection algorithm based on a Merge sort. I would like to improve performance.
public IEnumerable<T> MergeSort<T>(List<T> list, int left, int right, Comparer<T> comparer)
{
if (left == right)
{
yield return list[left];
yield break;
}
//divide
int mid = (left + right) / 2;
var firstEnumerable = MergeSort(list, left, mid, comparer);
var secondEnumerable = MergeSort(list, mid + 1, right, comparer);
//merge
using (var firstEnumerator = firstEnumerable.GetEnumerator())
using (var secondEnumerator = secondEnumerable.GetEnumerator())
{
bool first = firstEnumerator.MoveNext();
bool second = secondEnumerator.MoveNext();
while (first && second)
{
if (comparer.Compare(firstEnumerator.Current, secondEnumerator.Current) < 0)
{
yield return firstEnumerator.Current;
first = firstEnumerator.MoveNext();
}
else
{
yield return secondEnumerator.Current;
second = secondEnumerator.MoveNext();
}
}
while (first)
{
yield return firstEnumerator.Current;
first = firstEnumerator.MoveNext();
}
while (second)
{
yield return secondEnumerator.Current;
second = secondEnumerator.MoveNext();
}
}
}
What it does : it recursively divide the list into smaller sequences (until the sequence has only one element). Then, it repeatedly merge sequences to produce new sorted ones until there is only 1 sequence remaining.
The main idea is to use IEnumerable<T> so there is no need to allocate arrays to merge results AND I can sort the list lazily and stop when I want. Example :
var list = ... // 1.000.000 elements
MergeSort(list, 0, list.length - 1, comparer).Take(50);
The actual performance to sort 1M integers and return the first 50 ones is 600 ms why I found to be slower than expected. Returning only the first element give a similar performance.
My main concern is the recursive calls between Enumerators/IEnumerables. I have tried to wrote the same logic using a stack (to fully avoid recursion) but I don't know how to implement it.
I have also tried to isolate the merge code part (the code inside the two usings statements) into a separate method but it run considerably slower (about 1 sec). I don't know why.
I could easily parallelise the algorithm or use another selection algorithm (like quick select) but this is outside the scope of this question.
int mid = (left + right) / 2;is a common fault in divide-and-conquer algorithms, which leads to arithmetic overflow if you have more items to sort thanINT_MAX/2. Useint mid = left + (right - left) / 2;instead. \$\endgroup\$