2

I'm using the below code to get an array of elements from list that sum to value. However, it only using 1 CPU. My PC has 64 cores so I want to use 100% CPU to speed up the process, so could you please help me update the code to use Parallel.ForEach (multiple threads)?

using System;
using System.Collections.Generic;
using System.Linq;

IEnumerable<List<int>> subset_sum(IEnumerable<int> numbers, int target,
    IEnumerable<int> partial = null, int partial_sum = 0)
{
    partial ??= Enumerable.Empty<int>();
    if (partial_sum == target) yield return partial.ToList();
    if (partial_sum >= target) yield break;
    foreach (var (n, i) in numbers.Select((n, i) => (n, i)))
    {
        var remaining = numbers.Skip(i + 1);
        foreach (var subset in subset_sum(
            remaining, target, partial.Append(n), partial_sum + n))
                yield return subset;
    }
}

var numbers = new List<int> { 3, 9, 8, 4, 5, 7, 10 };
var target = 15;
foreach (var subset in subset_sum(numbers, target))
    Console.WriteLine($"[{string.Join(", ", subset)}] = {subset.Sum()}");
1
  • 1
    This seems as a very inefficient way to go about this problem. Every time a Skip() is issued it does an internal loop. Fix the single threaded version first and then look into parallelizing it. Commented Apr 20, 2022 at 14:03

2 Answers 2

2

This post is not an answer on how to parallelize the subset-finding algorithm. It just shows that a faster single-thread implementation is possible. The implementation below is not recursive. It uses a Stack<int> as a mechanism for finding all the combinations of the source sequence (the pending stack). It is about 7-8 times faster than the recursive implementation in my PC.

IEnumerable<IList<int>> FindSubsets(IEnumerable<int> source, int target)
{
    int[] sourceArray = source.ToArray();
    if (target == 0) yield return Array.Empty<int>();

    List<int> subset = new(sourceArray.Length); // Indices
    int subsetSum = 0;
    Stack<int> pending = new(sourceArray.Length); // Indices
    pending.Push(0);

    while (pending.TryPop(out var index))
    {
        while (subset.Count > pending.Count)
        {
            subsetSum -= sourceArray[subset[subset.Count - 1]];
            subset.RemoveAt(subset.Count - 1);
        }
        for (; index < sourceArray.Length; index++)
        {
            subset.Add(index);
            subsetSum += sourceArray[index];
            if (subsetSum == target)
                yield return subset.Select(i => sourceArray[i]).ToArray();
            if (index + 1 < sourceArray.Length) pending.Push(index + 1);
        }
    }
}
Sign up to request clarification or add additional context in comments.

Comments

1

Parallelizing recursive algorithms is not easy. It's not obvious how to partition the workload, when the function that does the work can create more work recursively. In this particular case it is possible to partition the work by splitting the numbers into two parts, and using one of the two parts as the partitions. Then for each partition search for combinations in the other part that have a sum equal to the target minus the sum of the partition. Below is an implementation of this idea. The PLINQ library is used for the parallelization:

static IEnumerable<IList<int>> FindSubsets(IEnumerable<int> source, int target)
{
    const int leftSize = 8; // Results in 256 partitions
    int[] sourceArray = source.ToArray();
    int[] left = sourceArray.Take(leftSize).ToArray();
    int[] right = sourceArray.TakeLast(sourceArray.Length - left.Length).ToArray();
    return Partitioner
        .Create(Combinations(left), EnumerablePartitionerOptions.NoBuffering)
        .AsParallel()
        .WithDegreeOfParallelism(Environment.ProcessorCount)
        .WithMergeOptions(ParallelMergeOptions.NotBuffered)
        .SelectMany(leftPart =>
        {
            int leftPartSum = leftPart.Sum();
            return FindSubsets_Internal(right, target - leftPartSum).
                Select(rightPart => leftPart.Concat(rightPart).ToArray());
        });

    static IEnumerable<int[]> Combinations(IEnumerable<int> remaining,
        IEnumerable<int> partial = null)
    {
        partial ??= Enumerable.Empty<int>();
        yield return partial.ToArray();
        foreach (var (n, i) in remaining.Select((n, i) => (n, i)))
        {
            var innerRemaining = remaining.Skip(i + 1);
            var inner = Combinations(innerRemaining, partial.Append(n));
            foreach (var subset in inner) yield return subset;
        }
    }

    static IEnumerable<int[]> FindSubsets_Internal(IEnumerable<int> remaining,
        int target, IEnumerable<int> partial = null, int partialSum = 0)
    {
        partial ??= Enumerable.Empty<int>();
        if (partialSum == target) yield return partial.ToArray();
        if (partialSum >= target) yield break;
        foreach (var (n, i) in remaining.Select((n, i) => (n, i)))
        {
            var innerRemaining = remaining.Skip(i + 1);
            var inner = FindSubsets_Internal(
                innerRemaining, target, partial.Append(n), partialSum + n);
            foreach (var subset in inner) yield return subset;
        }
    }
}

This implementation is not much faster than the original single-thread method, because the recursive algorithm is quite memory-hungry and inefficient. Comparing the value of the GC.GetTotalAllocatedBytes(true) before and after the execution reveals that the memory allocation is massive: around 700 MB per second in my PC. So instead of the bottleneck being the CPU, it's the bandwidth of the memory bus. The CPU cores are mostly waiting for data to be stored or be fetched from the main memory, instead of calculating. At least this is my theory. For comparison parallelizing the non-recursive algorithm that I've posted in my other answer, yields reasonable performance improvements (around 2.4x on my quad core PC).

5 Comments

Thank you for your answer @Theodor, so if my PC has 64 cores then the condition will be if (wasScheduled && depth + i <= 64). Is it correct?
@ThanhNguyenViet no, the depth + i <= 5 condition is supposed to be near optimal, roughly for any number of cores. The idea is to give each thread big chunks of work to do. We don't want to funnel everything through the queue. Ideally, for 64 cores, we would like to partition the work to exactly 64 equally sized chunks. This is impossible in this case, so it might be good enough to partition it to, say, 10,000 variably sized chunks. But if we partition it to, say, 10,000,000 feather-weight chunks, the synchronization overhead will kill all the benefits of parallelism.
@ThanhNguyenViet I tried to gather more information about how exactly the above implementation partitions the work, but it's far from a trivial task. Augmenting the method with gathering of meaningful statistics, might be harder than writing it in the first place. Honestly I don't think that it's worth the effort. Even with perfect parallelization, this algorithm would allow to get results for a collection having 5-6 more numbers. Each added number doubles the possible combinations. AFAIK it's an NP-hard problem.
I used your code and found it only uses about 70-80% CPU instead of 100%, don't know how to make it use 100% CPU. However, I feel happy with your code, it's faster than my code (about 30-40% CPU). Thank you for sharing your code.
​@ThanhNguyenViet I updated my answer with a different implementation, that follows a more straightforward logic.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.