Step 1 – Let’s just get this shit working
Raytracing itself is theoretically a simple thing. You generate a ray, throw it into the screen, and get some output color. Stripping away the need to handle things like camera settings, FOV, etc reduces a lot of the complexity for generating a basic image. The basic algorithm for generating that image is pretty simple.
// For each pixel in x/y
for (uint32_t j = 0; j < YSize; ++j)
{
	for (uint32_t i = 0; i < XSize; ++i)
	{
		Color PixelColor = Color();
		//  Run a trace for each sample, then average the summed color
		for (uint32_t s = 0; s < SampleCount; ++s)
		{
			// Randomize the vector a bit with a random +/- 0.5f generator
			float U = float(i + distribution(generator)) / XSizef;
			float V = float(j + distribution(generator)) / YSizef;
			Ray TraceRay(Origin, BottomLeft + U * HorizSize + V * VertSize);
			PixelColor += GetColorForRay(Shapes, TraceRay);
		}
		PixelColor /= float(SampleCount);
	}
}
Running this gets me the sample image above with a general runtime around 2800ms. It’s functional but clearly wouldn’t scale well to an increase in resolution or sample count, so it was time to start looking at getting it hooked up to worker threads.
For my first threading pass, I figured I’d just throw something in there to get it working and go from there. I wasn’t expecting much improvement, but figured I’d at least see a bit of improvement as I increased thread count. For the purposes of this first pass test, I also run the test in a loop with different thread counts just to see how it scales. For the naive pass, I have it spawning a worker thread for each pixel, and running the anti-aliasing samples within the worker thread. After it spawns the workers, it waits on all of them to complete before starting the next batch. It gives me something like this:
uint32_t MaxThreadCount = std::thread::hardware_concurrency();
{
	for (uint32_t ThreadCount = 4; ThreadCount <= MaxThreadCount; ThreadCount += 4)
	{
		for (uint32_t j = 0; j < YSize; ++j)
		{
			for (uint32_t i = 0; i < XSize; i += ThreadCount)
			{
				std::vector<std::thread> Threads;
				for (uint32_t t = 0; t < ThreadCount && (i + t) < XSize; ++t)
				{
					Threads.push_back(std::thread(&SphereTest_NaiveThreading::BatchThread, this, i + t, j, SampleCount, XSizef, YSizef, Shapes, OutputImage));
				}
				std::for_each(Threads.begin(), Threads.end(), std::mem_fn(&std::thread::join));
			}
		}
			std::cout << "Sphere - Total Time Taken (ms):" << diff << " for thread count:" << ThreadCount << std::endl;
	}
}	
void SphereTest_NaiveThreading::BatchThread(uint32_t i, uint32_t j, uint32_t SampleCount, float XSizef, float YSizef, std::vector<Shape*> Shapes, class Image* OutputImage)
{
	std::default_random_engine generator;
	std::uniform_real_distribution<float> distribution(-0.5f, 0.5f);
	Color PixelColor = Color();
	for (uint32_t s = 0; s < SampleCount; ++s)
	{
		float U = float(i + distribution(generator)) / XSizef;
		float V = float(j + distribution(generator)) / YSizef;
		Ray TraceRay(Origin, BottomLeft + U * HorizSize + V * VertSize);
		PixelColor += GetColorForRay(Shapes, TraceRay);
	}
	PixelColor /= float(SampleCount);
}
Slap it together, add some timing checks to it, and run it. The images look fine, so I take a look at the timings, and this was the result:
Sphere - Total Time Taken (ms):24010 for thread count:4 Sphere - Total Time Taken (ms):21393 for thread count:8 Sphere - Total Time Taken (ms):20014 for thread count:12 Sphere - Total Time Taken (ms):19894 for thread count:16 Sphere - Total Time Taken (ms):19437 for thread count:20 Sphere - Total Time Taken (ms):19452 for thread count:24 Sphere - Total Time Taken (ms):19105 for thread count:28 Sphere - Total Time Taken (ms):19051 for thread count:32
Oh.
Admittedly I expected that the cost of spinning up the threads was going to be non-trivial, but I wasn’t expecting that. The naive threading at 32 threads was almost 7x slower than the unthreaded pass. Taking a look at even the basic performance metrics coming from Visual Studio, the batch thread was less than 10% of the total execution time of the program for the 32-thread test. That’s pretty fucking bad. So, there’s a few obvious problems for me to look at, and it gives me somewhere to start:
- My unit of work for the threads is too small. That needs to be larger. For now that can just be brute force making it larger. In future tests with reflection/refraction, this would want to be a bit smarter since each work unit wouldn’t have a fixed ray count, but it gives me somewhere to start.
- Doing the join on the worker thread batch could present thread waits in spots where the work size of the thread is an unknown. In the future, I want to change this to some sort of message pump where completed threads can trigger a new worker thread so we always have close to max threads working on casts. However, with the work size being fixed right now, this is a lower priority for investigation. I am curious about how much of a gap there is on the joins, and this is where having a working copy of AMD uProf would help me out a lot, but it’s also not a current focus so I’m not worried about it for now.
One Reply to “Programmer Ramblings – Screwing Around With std::thread”
Comments are closed.