Programmer Ramblings – Continued Screwing with std::thread

Making the threads work for me

So for this test, I basically went with this setup:

  • Create N threads at startup to live for the lifetime of the project.
  • Each thread individually will request new work as it completes its current task until the program tells it there’s nothing left and the thread shuts down.
  • On all threads shutting down, I can exit.

For a quick starting reference, here’s timings of the last iteration that ended yesterday’s ramblings and the timings of this new iteration side by side:

Sphere Threading Iteration 2 - Total Time Taken (ms):804 for thread count:4
Sphere Threading Iteration 2 - Total Time Taken (ms):433 for thread count:8
Sphere Threading Iteration 2 - Total Time Taken (ms):298 for thread count:12
Sphere Threading Iteration 2 - Total Time Taken (ms):237 for thread count:16
Sphere Threading Iteration 2 - Total Time Taken (ms):192 for thread count:20
Sphere Threading Iteration 2 - Total Time Taken (ms):166 for thread count:24
Sphere Threading Iteration 2 - Total Time Taken (ms):154 for thread count:28
Sphere Threading Iteration 2 - Total Time Taken (ms):145 for thread count:32

Sphere Threading Iteration 3 - Total Time Taken (ms):717 for thread count:4
Sphere Threading Iteration 3 - Total Time Taken (ms):372 for thread count:8
Sphere Threading Iteration 3 - Total Time Taken (ms):265 for thread count:12
Sphere Threading Iteration 3 - Total Time Taken (ms):210 for thread count:16
Sphere Threading Iteration 3 - Total Time Taken (ms):173 for thread count:20
Sphere Threading Iteration 3 - Total Time Taken (ms):150 for thread count:24
Sphere Threading Iteration 3 - Total Time Taken (ms):136 for thread count:28
Sphere Threading Iteration 3 - Total Time Taken (ms):127 for thread count:32

In good news, this scaled linearly in the same fashion and even showed some performance improvements at the end. For this specific test, I set it up where each thread can request a unit of work that comprises one complete row of my output image at a time. This matches with the first iteration that I messed around with yesterday. In that case, going to 32 threads resulted in me starting to see a distinct drop in performance, with my speculating that it was because of potential cost of the join wait and variations in thread ending. In getting VTune working, I’ve confirmed that for sure, as well as confirmed why this 3rd iteration worked better in practice than the 2nd iteration with fixed percentage work units. I’ll cover that a bit more in detail in the next page.

Here’s the unit test I wrote to test out this idea:

void SphereTest_ThreadIter3::DoTest()
{
	uint32_t MaxThreadCount = std::thread::hardware_concurrency();
	{
		// Do multiple loops to test performance at different thread counts
		for (uint32_t ThreadCount = 4; ThreadCount <= MaxThreadCount; ThreadCount += 4)
		{
			CurrentRow = 0;
			// Create the image - I stripped out stuff here with writing out the image, deleting the pointer, etc to clean up the code
			if (Image* OutputImage = new Image(FileName.str()))
			{
				// Create the threads and set them loose.  Wait for them after the loop until they all complete
				std::vector<std::thread> Threads;
				for (uint32_t t = 0; t < ThreadCount; ++t)
				{
					Threads.push_back(std::thread(&SphereTest_ThreadIter3::BatchThread, this, OutputImage));
				}
				std::for_each(Threads.begin(), Threads.end(), std::mem_fn(&std::thread::join));
			}
		}
	}
}

void SphereTest_ThreadIter3::BatchThread(class Image* OutputImage)
{
	std::default_random_engine generator;
	std::uniform_real_distribution<float> distribution(-0.5f, 0.5f);

	uint32_t Row = 0;
	// In this test, a work unit is a full row of the output image.  The class data keeps track of which row should be the next request, hands it up to the thread, then the data is crunched.  The rest is a standard raytracing pass, spitting out n rays per-pixel of the row to get color data back.
	while (RequestWork(Row))
	{
		for (uint32_t i = 0; i < XSize; ++i)
		{
			Color PixelColor = Color();
			for (uint32_t s = 0; s < SampleCount; ++s)
			{
				float U = float(i + distribution(generator)) / XSizef;
				float V = float(Row + distribution(generator)) / YSizef;

				Ray TraceRay(Origin, BottomLeft + U * HorizSize + V * VertSize);
				PixelColor += GetColorForRay(Shapes, TraceRay);
			}
			PixelColor /= float(SampleCount);

			OutputImage->SetPixel(i, Row, PixelColor);
		}
	}		
}

// Simple function to return the current work unit.  I probably don't need the mutex on the row increment, but being extra safe.  If this returns false, the while loop in the worker thread function exist and the thread completes.
bool SphereTest_ThreadIter3::RequestWork(uint32_t& Row)
{
	if (CurrentRow >= YSize)
	{
		return false;
	}

	std::mutex mtx;
	mtx.lock();
	Row = CurrentRow;
	CurrentRow++;
	mtx.unlock();

	return true;
}

Being perfectly honest, I likely don’t need the mutex in the work request function, but I’m being super safe there. It may save a few ms of total execution time. However, the important thing here is that this pattern will easily adjust to me doing non-fixed size work units where bounce rays start to make each row’s total ray count an unknown.