Raytracting in One Weekend OpenMP Parallelization

Question

I've been following the tutorial series "Ray Tracing in One Weekend" which seems relatively canonical in terms of learning ray tracing.

I've been trying to speed up the code using OpenMP but I've received some rather disappointing results, and from this Github discussion I believe that others have accomplished better speedup.

Here's what some of the users I've seen so far report:

It's also worth noting that the changes are very local: Write into buffer instead of std::cout line; new method to write buffer to file; single OpenMP line above for loop over rows of the output image.

I was able to achieve significant speedup, the need for which became painfully apparent in the final scene of the second book, by using #pragma omp parallel for before the multi sample loop

With OpenMP, you can achieve this with two lines of OpenMP annotations and not a single change to the code itself

Using these strategies, I was unable to get any significant speedup.

I made a copy of the latest version of the Github repo (v4.0.1) and worked on the "In One Weekend" section.

I added the annotation #pragma omp parallel for reduction(+:pixel_color) around the sample for loop (has header for (int sample = 0; sample < samples_per_pixel; sample++)), and the header #pragma omp declare reduction(+ : vec3 : omp_out+=omp_in) initializer(omp_priv(0,0,0)) to define the reduction.

I timed only how long it took to complete cam.render(world) using std::chrono::steady_clock. Rendering the default scene, this gave me a speedup of only 1.74x. However, this feels suspiciously low considering I'm using 8 cores (and verified that number with omp_get_num_procs()).

I reverted to commit 2e5cc2e (for no other reason than the fact that it was released on Dec 9, 2020, the same day another Github user made a post about achieving significant speedup with OpenMP). I modified the code to write to the 2D vector image instead of outputting to stdout, and then after, write image to a file. I timed how long it took to populate the colors in image. The modified code in scene.h looks like this:

        std::vector<std::vector<color>> image(image_height, std::vector<color>(image_width));
        omp_set_num_threads(8);
        #pragma omp parallel for
        for (int j = image_height-1; j >= 0; --j) {
          for (int i = 0; i < image_width; ++i) {
            color pixel_color(0,0,0);
            for (int s = 0; s < samples_per_pixel; ++s) {
              auto u = (i + random_double()) / (image_width-1);
              auto v = (j + random_double()) / (image_height-1);
              ray r = cam.get_ray(u, v);
              pixel_color += ray_color(r, max_depth);
            }
         image[j][i] = pixel_color * pixel_samples_scale;
          }
        }

This gave me a speedup of only 2.34x, but again, I'm using 8 cores and would expect something higher.

I have been compiling with these C++ flags: -O3 -Wall -std=c++17 -m64 -I. -fopenmp. All header files are protected with #ifndef, so no need to use #pragma once at the top. I've also experimented with using schedule(dynamic) (which seems reasonable for raytracing), but that only made the speedup lower.

More information surrounding this question can again be seen at this Github discussion that has recently been created. I believe that faster speedup should be easy to achieve, I'm just not sure why my headers do not provide that.

Thanks for any input and let me know if I can provide more details.

"I've been following the tutorial series "Ray Tracing in One Weekend" which seems relatively canonical in terms of learning ray tracing" - Based on what exactly? — Jesper Juhl
– Jesper Juhl, Commented Dec 8, 2024 at 0:08
It is bad practice to use omp_set_num_threads. Using the env variable provides the same functionality without the need to recompile whenever you want to change the number. Given the significant work done in each iteration of the outer loop, the overhead of dynamic schedule will easily amortized and help to balance the work between the threads. — Joachim
– Joachim, Commented Dec 9, 2024 at 19:59

Jerry Coffin · Accepted Answer · 2024-12-06 06:06:00Z

1

One possibility would be the use of two calls to random_double in each iteration of the inner loop of your code.

The books provide two separate implementations of random_double:

    inline double random_double() {
        static std::uniform_real_distribution<double> distribution(0.0, 1.0);
        static std::mt19937 generator;
        return distribution(generator);
    }

and:

    inline double random_double() {
        // Returns a random real in [0,1).
        return std::rand() / (RAND_MAX + 1.0);
    }

If you're using the version that's a wrapper around std::rand(), problems with scaling are quite understandable. The source of the problem is fairly simple: std::rand normally has a seed to maintain state between one call and the next. During each call, that state is updated. There are a couple of different ways to do this while preventing the seed from being corrupted. One is to use a mutex, so calls to std::rand are (mostly) serialized. Another is to (behind the scenes) create a thread-local seed value, so each thread gets its own see to play with, and each can update the seed without affecting other threads. That introduces some difficulties of its own, but it does scale much better as you add more threads.

std::mt19937 is a C++ object though. That object contains the seed for the generator. Each object of that type has its own seed. So although you may need to do a tiny bit of extra work to assure that each thread has its own random number generator object, when/if you do so, it pretty much assures that you won't have state shared between the threads to limit scaling as you execute with more threads.

answered Dec 6, 2024 at 6:06

Jerry Coffin

494k83 gold badges655 silver badges1.1k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

C Diamond Dec 7, 2024 at 17:09

so I replaced all instances of random_double() in the code with drand48(). when I did this for the v4.01 code for Raytracing in One Weekend, the best speedup with 8 cores I got was 3.41x. when I did this for the depreciate InOneWeekend repo, I got a speedup of 5.5x. so certainly better, thank you!

Joachim Dec 9, 2024 at 20:03

@CDiamond The openmp way to mark a variable thread-local is using the threadprivate directive: #pragma omp threadprivate (generator) next to the declaration of the variable makes the variable thread local when the code is compiled with openmp flag. With this modification and the dynamic schedule, I got a 10x speedup for 12 threads.

Collectives™ on Stack Overflow

Raytracting in One Weekend OpenMP Parallelization

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related