I've been following the tutorial series "Ray Tracing in One Weekend" which seems relatively canonical in terms of learning ray tracing.
I've been trying to speed up the code using OpenMP but I've received some rather disappointing results, and from this Github discussion I believe that others have accomplished better speedup.
Here's what some of the users I've seen so far report:
It's also worth noting that the changes are very local: Write into buffer instead of std::cout line; new method to write buffer to file; single OpenMP line above for loop over rows of the output image.
I was able to achieve significant speedup, the need for which became painfully apparent in the final scene of the second book, by using #pragma omp parallel for before the multi sample loop
With OpenMP, you can achieve this with two lines of OpenMP annotations and not a single change to the code itself
Using these strategies, I was unable to get any significant speedup.
I made a copy of the latest version of the Github repo (v4.0.1) and worked on the "In One Weekend" section.
I added the annotation #pragma omp parallel for reduction(+:pixel_color) around the sample for loop (has header for (int sample = 0; sample < samples_per_pixel; sample++)), and the header #pragma omp declare reduction(+ : vec3 : omp_out+=omp_in) initializer(omp_priv(0,0,0)) to define the reduction.
I timed only how long it took to complete cam.render(world) using std::chrono::steady_clock. Rendering the default scene, this gave me a speedup of only 1.74x. However, this feels suspiciously low considering I'm using 8 cores (and verified that number with omp_get_num_procs()).
I reverted to commit 2e5cc2e (for no other reason than the fact that it was released on Dec 9, 2020, the same day another Github user made a post about achieving significant speedup with OpenMP). I modified the code to write to the 2D vector image instead of outputting to stdout, and then after, write image to a file. I timed how long it took to populate the colors in image. The modified code in scene.h looks like this:
std::vector<std::vector<color>> image(image_height, std::vector<color>(image_width));
omp_set_num_threads(8);
#pragma omp parallel for
for (int j = image_height-1; j >= 0; --j) {
for (int i = 0; i < image_width; ++i) {
color pixel_color(0,0,0);
for (int s = 0; s < samples_per_pixel; ++s) {
auto u = (i + random_double()) / (image_width-1);
auto v = (j + random_double()) / (image_height-1);
ray r = cam.get_ray(u, v);
pixel_color += ray_color(r, max_depth);
}
image[j][i] = pixel_color * pixel_samples_scale;
}
}
This gave me a speedup of only 2.34x, but again, I'm using 8 cores and would expect something higher.
I have been compiling with these C++ flags: -O3 -Wall -std=c++17 -m64 -I. -fopenmp. All header files are protected with #ifndef, so no need to use #pragma once at the top. I've also experimented with using schedule(dynamic) (which seems reasonable for raytracing), but that only made the speedup lower.
More information surrounding this question can again be seen at this Github discussion that has recently been created. I believe that faster speedup should be easy to achieve, I'm just not sure why my headers do not provide that.
Thanks for any input and let me know if I can provide more details.
omp_set_num_threads. Using the env variable provides the same functionality without the need to recompile whenever you want to change the number. Given the significant work done in each iteration of the outer loop, the overhead of dynamic schedule will easily amortized and help to balance the work between the threads.