This is a followup to yesterday’s ramblings about initial work screwing with std::thread.
In talking with Quintosh on Twitter, he asked why I didn’t go with some sort of thread pool. Frankly, it’s because I’m lazy. However, it did give me an idea. The data that I’m working on is fairly parallelizable without memory collisions. Rather than creating a thread pool where I hand work to threads, it would be a lot lower amount of work for a simple program if I let the threads directly request work themselves. This gave me a few things:
A unit of work could be much smaller, so cores that are running slower because of other system processes don’t present blocks.
Once I implement other things that scale the program non-linearly (ex: reflection/refraction), I don’t have to worry about intelligently breaking up the work into equal sizes.
The only member that is accessed across many threads is the piece of data controlling the work request. This keeps my actual locks to a minimum.
Much to my surprise, I also managed to get Intel VTune working completely. This helped confirm some of my assumptions from yesterday’s article, so I’ll cover that in some detail later on.
Being a primarily Unreal-focused developer, I don’t really spend that much time in standard C++. Ya, technically I work a lot in C++, but C++ using STL and related things is very different from the custom containers and macro-heavy nature of working in Unreal. Part of the fallout of that is that I generally miss new features of the language for quite a while. I didn’t get into working in C++11 and newer until I was off of working in UE3, and even moving into UE4 I don’t get exposed to things like STL or the the standard implementation of threads. It’s one of those things where, ya I’ve used threads and I get the concepts behind it, but creating a worker thread to offload a specific task is much different than architecting code to properly and efficiently support a threadable workload. That’s where my screwing around here comes into play.
For this screwing around, I decided to thread an implementation of a raytracer. It’s a workload that is inherently parallelizable. You’ve got a bunch of rays going out that can independently resolve themselves. Ya, you may need to have a ray spawn further rays, but that can live within the worker thread as it chews through the work. From a naive implementation standpoint, each pixel could be its own thread and run in parallel, and that’s basically where I started.
Some notes:
For the purposes of this, I started with a sample implementation from Ray Tracing in One Weekend by Peter Shirley. This series of books is a supremely fantastic quick look at basic concepts behind ray tracing, and gave me a quick place to get to a point where I could investigate threading.
For my CPU, I’m running this on an AMD 3950x (16 core, 32 thread) at stock speeds. I’m not doing anything to minimize background processes, but it shouldn’t be a huge issue for where I’m at.
I’m currently using Visual Studio 2019’s built-in performance profiler. I don’t particularly like it compared to other tools, but my profiler of choice on my current hardware (AMD uProf) currently has a bug on some installs of the May 2020 version of Windows 10 that prevents profile captures. The VS profiler is basic, but gets me enough information for the basics that I’m starting at.
This is running in release with default optimizations purely out of laziness.
I’ll post some code samples around. These won’t generally compile because I’m stripping out unnecessary stuff from the samples (ex: you don’t need to care about me setting image dimensions and writing it to disk).
For the purposes of my current testing, this is the output image. It’s two spheres where each pixel color represents the surface normal hit by a ray. The image is 800×400 resolution and each pixel does 100 slightly randomized rays to give an anti-aliased result. In the basic current pass, I’m not doing any bounced rays on collisions. The final image is therefore the result of 32 million ray casts. In some future tests, I’ll be adapting the rest of the book to the multithreaded version and support reflection/refraction and increasing the workload through that process.
I worked on Maneater from approximately September 2018 to November 2019 when I left Tripwire Interactive. A lot of what I tended to focus on through my time on Maneater was CPU-based performance gains on the UE4 game thread, so I wasn’t generally involved in other typical performance bottlenecks (particle cost, GPU time, etc), but there was enough work on the game thread to more than fill my time while I was on the project. Among other things, two of the systems that I was most heavily involved in were a refactor of the spawn system, and work around handling runtime performance optimizations. While I can’t really speak to what performance looks like on the shipping game, I can speak to some of the choices I made along the way, and how those two systems work in tandem to achieve performance gains.
Where the hell do I start?
That really was the first decision to be made. The game I started working on when brought in-house to Tripwire was much different from what Maneater became, but at its core the big problem was that the original game was made for high end PCs. It had a ton of classic spawn points, spawned everything pretty much at once, and didn’t do any performance reduction on the actors in the world. I knew one of our goals at the time was to see if it could get running on Switch, and I absolutely knew it needed to launch on Xbox One and PS4, all three of which are distinctly not high end PCs. With that in mind, I had at least a rough approach planned out:
Do all my optimization work on Switch. Any frame time gained there applied to other platforms. If I had the game thread at ~26-28ms on Switch, the other platforms sure as hell would run fine.
Fix spawning first. I needed to control how many things were in the scene at one time as a first order of business.
This also came with refactoring the design tools. Hand placing spawn points didn’t scale well for level designers, and at runtime iterating over hundreds of hand-placed actors didn’t make sense. This needed to be replaced with something else.
With spawning under greater control, start working on more granular framerate improvements:
Once I could run at a stable framerate with spawning under control, start optimizing the runtime aspects of the game.
As the framerate improved, start spawning more things to fill the now free frame time to give higher AI density in the world.
Rinse and repeat.
This formed the basis of my plan to get the game running well. While it wasn’t the only set of systems I was involved in, it was definitely the one I spent the most time on throughout my life on the project.