0

Recently I've played with parallel loops. I started with simple tasks as it is filling in a huge array.

However, the creation time was half of second when the code was NOT parallel and 6.03s (sic!) when the code WAS parallel.

How come?

I thought there is no simpler task to show benefits from parallelism as dividing huge task to smaller one, as I did.

Could some one explain?

12GB RAM, i7 extreme 980 (6 cores + 6 virtual) 3.06G

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace ParallelLoop
{
    class Program
    {

        static void Main(string[] args)
        {

            int Min = 0;
            int Max = 10;
            int ArrSize = 150000000;

            Stopwatch sw2 = new Stopwatch();
            Stopwatch sw3 = new Stopwatch();


            int[] test2 = new int[ArrSize];
            int[] test3 = new int[ArrSize];

            Random randNum = new Random();

            sw2.Start();
            for (int i = 0; i < test2.Length; i++)
            {
                test2[i] = i;
                //test2[i] = randNum.Next(Min, Max);
            }
            sw2.Stop();

            Console.ReadKey();
            Console.WriteLine("Elapsed={0}", sw2.Elapsed);

            sw3.Start();

            Parallel.For(0, test3.Length, (j) =>
                {
                    test3[j] = j;
                    //test3[j] = randNum.Next(Min, Max);
                }
                );

            sw3.Stop();

            Console.WriteLine("Elapsed={0}", sw3.Elapsed);
            Console.ReadKey();

        }
    }
}
3
  • 4
    Just guessing here, but setting and array slot to an integer is so fast that the cost of using threads for this may be more than just setting all slots in a single loop. Setting up and switching threads is very, very expensive so if a task is really simple, it's not worth splitting it up. If a single task is complex and you have many identical tasks, then the cost of using threads is usually dwarfed by the cost of the tasks. Commented Mar 13, 2015 at 20:29
  • 2
    Try something more complex, it could be that the compiler is using a heavy handed optimization that it only detects in the former. For instance turning your loads into vectorized loads which can be a lot faster. Commented Mar 13, 2015 at 20:31
  • 1
    Parallelism should only be used when you're doing actual work. Here, your problem is simple enough that a single core can easily max out your memory bandwidth. When this is the case, adding parallelism will just slow it down. Commented Mar 13, 2015 at 21:27

4 Answers 4

2

While the other answers do have valid point, they don't provide correct solution. You can use threads to improve the performance, but you must do it in the right way. In your case, you simply divide the whole array into N blocks (where N is amount of cores you have) and have each thread work in it's own block, without touching any other's block. This way, they don't have to worry about blocking each other.

Also note of warning. Random is not thread-save, so you should make sure that each thread has it's own instance. This will reduce randomness, but it is only way to use it with parallelism.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace ParallelLoop
{
    class Program
    {

        static void Main(string[] args)
        {

            int Min = 0;
            int Max = 10;
            int ArrSize = 150000000;

            Stopwatch sw2 = new Stopwatch();
            Stopwatch sw3 = new Stopwatch();
            Stopwatch sw4 = new Stopwatch();

            int[] test2 = new int[ArrSize];
            int[] test3 = new int[ArrSize];
            int[] test4 = new int[ArrSize];

            Random randNum = new Random();

            sw2.Start();
            for (int i = 0; i < test2.Length; i++)
            {
                test2[i] = i;
                //test2[i] = randNum.Next(Min, Max);
            }
            sw2.Stop();

            //Console.ReadKey();
            Console.WriteLine("Elapsed={0}", sw2.Elapsed);

            sw3.Start();

            Parallel.For(0, test3.Length, (j) =>
            {
                test3[j] = j;
                //test3[j] = randNum.Next(Min, Max);
            }
                );

            sw3.Stop();

            Console.WriteLine("Elapsed={0}", sw3.Elapsed);

            sw4.Start();

            int numberOfCores = 4;

            int itemsPerCore = ArrSize / numberOfCores;

            for (int i = 0; i < numberOfCores; i++)
            {
                int x = i; // for lambda closure
                var thread = new Thread(new ThreadStart(() =>
                {
                    int from = itemsPerCore * x;
                    int to = itemsPerCore * (x + 1);
                    for (int j = from; j < to; j++)
                    {
                        test4[j] = j;
                        //test4[j] = randNum.Next(Min, Max);                        
                    }
                }));

                thread.Start();
            }

            sw4.Stop();

            Console.WriteLine("Elapsed={0}", sw4.Elapsed);

            Console.ReadKey();
        }
    }
}
Sign up to request clarification or add additional context in comments.

Comments

1

I have decided to make my comment an answer. I haven't actually run the sample code included in the question but your task inside the Parallel loop is very simple: setting an array slot to an integer value is about the simplest thing the CPU can do and it does it very, very fast.

Compared to this, the cost of creating and switching threads to split up your initialization loop is huge: a thread switch can be tens of thousands of CPU cycles and the more threads you have, the more switching must happen to keep them running.

So in the case of your example, the thread switching code likely eats up any possible gains you get from splitting up your otherwise very long loop. If you tried doing something more complex inside your loop, using a Parallel loop would gain you more because the cost of the thread switch (which is still huge) would be dwarfed by the cose of a single loop iteration.

Joe Duffy has several articles that mention the cost of a context switch - here's one worth reading - he mentions that the cost is somewhere between 4,000+ to 10,000+ CPU cycles to perform a context switch.

Comments

1

As xxbbcc has pointed out, it is probably the context switching that is taking longer than the setting of a simple array value. You could simulate a long running operation by sleeping the thread to get a better idea of the performance gain:

[TestMethod]
public void One()
{
    int Min = 0;
    int Max = 10;
    int ArrSize = 1500;

    Stopwatch sw2 = new Stopwatch();
    Stopwatch sw3 = new Stopwatch();


    int[] test2 = new int[ArrSize];
    int[] test3 = new int[ArrSize];

    Random randNum = new Random();

    sw2.Start();
    for (int i = 0; i < test2.Length; i++)
    {
        test2[i] = i;
        Thread.Sleep(10);
        //test2[i] = randNum.Next(Min, Max);
    }
    sw2.Stop();

    Console.WriteLine("Elapsed={0}", sw2.Elapsed);

    sw3.Start();

    Parallel.For(0, test3.Length, (j) =>
    {
        test3[j] = j;
        Thread.Sleep(10);
        //test3[j] = randNum.Next(Min, Max);
    }
        );

    sw3.Stop();

    Console.WriteLine("Elapsed={0}", sw3.Elapsed);
}

Which yields the output:

Elapsed=00:00:16.4813668
Elapsed=00:00:00.7327932

Comments

1

I do not get the same drastic difference as you do (on an i7 920, nominally 2.66 GHz, 6 GB RAM - hence the smaller array size in the following code) between a simple loop and using simple parallelism.

As Euphoric pointed out, you need to partition the work - there is an overload of Parallel.ForEach which takes a RangePartitioner to do that for you, and in my testing it improved the speed somewhat:

using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Diagnostics;
using System.Threading.Tasks;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            int ArrSize = 100000000;

            Stopwatch sw2 = new Stopwatch();

            int[] test2 = new int[ArrSize];
            int[] test3 = new int[ArrSize];
            int[] test4 = new int[ArrSize];

            Random randNum = new Random();

            sw2.Start();

            for (int i = 0; i < test2.Length; i++)
            {
                test2[i] = i;
            }

            sw2.Stop();

            Console.WriteLine("Linear elapsed:          {0}", sw2.Elapsed);

            sw2.Restart();

            Parallel.For(0, test3.Length, (j) =>
            {
                test3[j] = j;
            }
                );

            sw2.Stop();

            Console.WriteLine("Simple parallel elapsed: {0}", sw2.Elapsed);

            sw2.Restart();

            var rangePartitioner = Partitioner.Create(0, test4.Length);
            Parallel.ForEach(rangePartitioner, (range, loopState) =>
            {
                for (int j = range.Item1; j < range.Item2; j++)
                {
                    test4[j] = j;
                }
            });

            sw2.Stop();

            Console.WriteLine("Partitioned elapsed:     {0}", sw2.Elapsed);

            Console.ReadLine();

        }
    }
}

Sample results:

Linear elapsed: 00:00:00.2312487
Simple parallel elapsed: 00:00:00.3735587
Partitioned elapsed: 00:00:00.1239631

I compiled for x64 and ran in release mode, not debug, as that is what counts in the end.

You also need to consider the processor's cache. There is an interesting article on that at Cache-Friendly Code: Solving Manycore's Need for Faster Data Access.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.