Selection algorithm using merge sort and IEnumerable

Question

For educational purposes, I wrote a selection algorithm based on a Merge sort. I would like to improve performance.

public IEnumerable<T> MergeSort<T>(List<T> list, int left, int right, Comparer<T> comparer)
{
    if (left == right)
    {
        yield return list[left];
        yield break;
    }

    //divide
    int mid = (left + right) / 2;
    var firstEnumerable = MergeSort(list, left, mid, comparer);
    var secondEnumerable = MergeSort(list, mid + 1, right, comparer);

    //merge
    using (var firstEnumerator = firstEnumerable.GetEnumerator())
    using (var secondEnumerator = secondEnumerable.GetEnumerator())
    {
        bool first = firstEnumerator.MoveNext();
        bool second = secondEnumerator.MoveNext();

        while (first && second)
        {
            if (comparer.Compare(firstEnumerator.Current, secondEnumerator.Current) < 0)
            {
                yield return firstEnumerator.Current;
                first = firstEnumerator.MoveNext();
            }
            else
            {
                yield return secondEnumerator.Current;
                second = secondEnumerator.MoveNext();
            }
        }

        while (first)
        {
            yield return firstEnumerator.Current;
            first = firstEnumerator.MoveNext();
        }

        while (second)
        {
            yield return secondEnumerator.Current;
            second = secondEnumerator.MoveNext();
        }
    }
}

What it does : it recursively divide the list into smaller sequences (until the sequence has only one element). Then, it repeatedly merge sequences to produce new sorted ones until there is only 1 sequence remaining.

The main idea is to use IEnumerable<T> so there is no need to allocate arrays to merge results AND I can sort the list lazily and stop when I want. Example :

var list = ... // 1.000.000 elements 
MergeSort(list, 0, list.length - 1, comparer).Take(50);

The actual performance to sort 1M integers and return the first 50 ones is 600 ms why I found to be slower than expected. Returning only the first element give a similar performance.

My main concern is the recursive calls between Enumerators/IEnumerables. I have tried to wrote the same logic using a stack (to fully avoid recursion) but I don't know how to implement it.

I have also tried to isolate the merge code part (the code inside the two usings statements) into a separate method but it run considerably slower (about 1 sec). I don't know why.

I could easily parallelise the algorithm or use another selection algorithm (like quick select) but this is outside the scope of this question.

Code style looks fine. By the time it knows the first it has done almost all the work. — paparazzo
– paparazzo, Commented Jul 18, 2017 at 16:59
int mid = (left + right) / 2; is a common fault in divide-and-conquer algorithms, which leads to arithmetic overflow if you have more items to sort than INT_MAX/2. Use int mid = left + (right - left) / 2; instead. — CiaPan
– CiaPan, Commented Jul 19, 2017 at 14:42

tigrou · Accepted Answer · 2017-07-19 16:31:08Z

I was able to get a performance increase (500 ms in average instead of 600 ms) by splitting the code in two methods : one that return a sequence with a single element, one that merge IEnumerables. I think this is faster because the implementation of the yield return statements is simpler for the compiler (AFAIK it is done using a state machine) .

public IEnumerable<T> MergeSort<T>(List<T> list, int left, int right, Comparer<T> comparer)
{
    if (left == right)
    {
        return SingleValue(list[left]);
    }

    int mid = (left + right) / 2;
    var firstEnumerable = MergeSort(list, left, mid, comparer);
    var secondEnumerable = MergeSort(list, mid + 1, right, comparer);
    return Merge(firstEnumerable, secondEnumerable, comparer);
}

public static IEnumerable<T> SingleValue<T>(T value)
{
    yield return value;
} 

public static IEnumerable<T> Merge<T>(IEnumerable<T> firstEnumerable, IEnumerable<T> secondEnumerable, Comparer<T> comparer)
{
    using (var firstEnumerator = firstEnumerable.GetEnumerator())
    using (var secondEnumerator = secondEnumerable.GetEnumerator())
    { 
         //same as before
    }
}

Performance can be improved further by checking the range of elements to sort inside MergeSort method. Above a certain threshold, another sort can be used (eg : InsertionSort or SelectionSort)

public IEnumerable<T> MergeSort<T>(List<T> list, int left, int right, Comparer<T> comparer)
{
    if (right - left <= threshold)
    {
        return SelectionSort(list, left, right, comparer);
    }

    //...
}

I think SelectionSort is a good candidate because there is a way to implement it lazily : it can return the smallest number very early without having to sort the whole list (eg: using a yield return). The partial merge sort now take about 15 ms to get the first 50th smallest numbers out of 1M integers.

Stack Exchange Network

Selection algorithm using merge sort and IEnumerable

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Selection algorithm using merge sort and IEnumerable

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions