{"id":1072,"date":"2020-07-21T00:56:42","date_gmt":"2020-07-21T04:56:42","guid":{"rendered":"https:\/\/www.blog.dwgames.net\/?p=1072"},"modified":"2020-07-21T00:58:07","modified_gmt":"2020-07-21T04:58:07","slug":"programmer-ramblings-screwing-around-with-stdthread","status":"publish","type":"post","link":"https:\/\/www.blog.dwgames.net\/?p=1072","title":{"rendered":"Programmer Ramblings &#8211; Screwing Around With std::thread"},"content":{"rendered":"\n<p>Being a primarily Unreal-focused developer, I don&#8217;t really spend that much time in standard C++.  Ya, technically I work a lot in C++, but C++ using STL and related things is <em>very<\/em> different from the custom containers and macro-heavy nature of working in Unreal.  Part of the fallout of that is that I generally miss new features of the language for quite a while.  I didn&#8217;t get into working in C++11 and newer until I was off of working in UE3, and even moving into UE4 I don&#8217;t get exposed to things like STL or the the standard implementation of threads.  It&#8217;s one of those things where, ya I&#8217;ve used threads and I get the concepts behind it, but creating a worker thread to offload a specific task is much different than architecting code to properly and efficiently support a threadable workload.  That&#8217;s where my screwing around here comes into play.<\/p>\n\n\n\n<p>For this screwing around, I decided to thread an implementation of a raytracer.  It&#8217;s a workload that is inherently parallelizable.  You&#8217;ve got a bunch of rays going out that can independently resolve themselves.  Ya, you may need to have a ray spawn further rays, but that can live within the worker thread as it chews through the work.  From a naive implementation standpoint, each pixel could be its own thread and run in parallel, and that&#8217;s basically where I started.<\/p>\n\n\n\n<p>Some notes:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>For the purposes of this, I started with a sample implementation from <em><a href=\"https:\/\/smile.amazon.com\/Ray-Tracing-Weekend-Minibooks-Book-ebook\/dp\/B01B5AODD8\/ref=sr_1_1?dchild=1&amp;keywords=ray+tracing+in+a+weekend&amp;qid=1595296096&amp;sr=8-1\">Ray Tracing in One Weekend<\/a><\/em> by Peter Shirley.  This series of books is a supremely fantastic quick look at basic concepts behind ray tracing, and gave me a quick place to get to a point where I could investigate threading.<\/li><li>For my CPU, I&#8217;m running this on an AMD 3950x (16 core, 32 thread) at stock speeds.  I&#8217;m not doing anything to minimize background processes, but it shouldn&#8217;t be a huge issue for where I&#8217;m at.<\/li><li>I&#8217;m currently using Visual Studio 2019&#8217;s built-in performance profiler.  I don&#8217;t particularly like it compared to other tools, but my profiler of choice on my current hardware (<em><a href=\"https:\/\/developer.amd.com\/amd-uprof\/\">AMD uProf<\/a><\/em>) currently has a bug on some installs of the May 2020 version of Windows 10 that prevents profile captures.   The VS profiler is basic, but gets me enough information for the basics that I&#8217;m starting at.<\/li><li>This is running in release with default optimizations purely out of laziness.<\/li><li>I&#8217;ll post some code samples around.  These won&#8217;t generally compile because I&#8217;m stripping out unnecessary stuff from the samples (ex: you don&#8217;t need to care about me setting image dimensions and writing it to disk).<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"400\" src=\"https:\/\/www.blog.dwgames.net\/wp-content\/uploads\/2020\/07\/SphereTestOutput.png\" alt=\"\" class=\"wp-image-1073\" srcset=\"https:\/\/www.blog.dwgames.net\/wp-content\/uploads\/2020\/07\/SphereTestOutput.png 800w, https:\/\/www.blog.dwgames.net\/wp-content\/uploads\/2020\/07\/SphereTestOutput-300x150.png 300w, https:\/\/www.blog.dwgames.net\/wp-content\/uploads\/2020\/07\/SphereTestOutput-768x384.png 768w\" sizes=\"auto, (max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px\" \/><\/figure>\n\n\n\n<p>For the purposes of my current testing, this is the output image.  It&#8217;s two spheres where each pixel color represents the surface normal hit by a ray.  The image is 800&#215;400 resolution and each pixel does 100 slightly randomized rays to give an anti-aliased result.  In the basic current pass, I&#8217;m not doing any bounced rays on collisions.  The final image is therefore the result of 32 million ray casts.  In some future tests, I&#8217;ll be adapting the rest of the book to the multithreaded version and support reflection\/refraction and increasing the workload through that process.<\/p>\n\n\n\n<!--nextpage-->\n\n\n\n<h2 class=\"wp-block-heading\">Step 1 &#8211; Let&#8217;s just get this shit working<\/h2>\n\n\n\n<p>Raytracing itself is theoretically a simple thing.  You generate a ray, throw it into the screen, and get some output color.  Stripping away the need to handle things like camera settings, FOV, etc reduces a lot of the complexity for generating a basic image.  The basic algorithm for generating that image is pretty simple.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"cpp\" data-enlighter-theme=\"godzilla\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">\/\/ For each pixel in x\/y\nfor (uint32_t j = 0; j &lt; YSize; ++j)\n{\n\tfor (uint32_t i = 0; i &lt; XSize; ++i)\n\t{\n\t\tColor PixelColor = Color();\n\n\t\t\/\/  Run a trace for each sample, then average the summed color\n\t\tfor (uint32_t s = 0; s &lt; SampleCount; ++s)\n\t\t{\n\t\t\t\/\/ Randomize the vector a bit with a random +\/- 0.5f generator\n\t\t\tfloat U = float(i + distribution(generator)) \/ XSizef;\n\t\t\tfloat V = float(j + distribution(generator)) \/ YSizef;\n\n\t\t\tRay TraceRay(Origin, BottomLeft + U * HorizSize + V * VertSize);\n\t\t\tPixelColor += GetColorForRay(Shapes, TraceRay);\n\t\t}\n\t\tPixelColor \/= float(SampleCount);\n\t}\n}<\/pre>\n\n\n\n<p>Running this gets me the sample image above with a general runtime around 2800ms.  It&#8217;s functional but clearly wouldn&#8217;t scale well to an increase in resolution or sample count, so it was time to start looking at getting it hooked up to worker threads.<\/p>\n\n\n\n<p>For my first threading pass, I figured I&#8217;d just throw something in there to get it working and go from there.  I wasn&#8217;t expecting much improvement, but figured I&#8217;d at least see a bit of improvement as I increased thread count.  For the purposes of this first pass test, I also run the test in a loop with different thread counts just to see how it scales.  For the naive pass, I have it spawning a worker thread for each pixel, and running the anti-aliasing samples within the worker thread.  After it spawns the workers, it waits on all of them to complete before starting the next batch.  It gives me something like this:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"cpp\" data-enlighter-theme=\"godzilla\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">uint32_t MaxThreadCount = std::thread::hardware_concurrency();\n{\n\tfor (uint32_t ThreadCount = 4; ThreadCount &lt;= MaxThreadCount; ThreadCount += 4)\n\t{\n\t\tfor (uint32_t j = 0; j &lt; YSize; ++j)\n\t\t{\n\t\t\tfor (uint32_t i = 0; i &lt; XSize; i += ThreadCount)\n\t\t\t{\n\t\t\t\tstd::vector&lt;std::thread> Threads;\n\t\t\t\tfor (uint32_t t = 0; t &lt; ThreadCount &amp;&amp; (i + t) &lt; XSize; ++t)\n\t\t\t\t{\n\t\t\t\t\tThreads.push_back(std::thread(&amp;SphereTest_NaiveThreading::BatchThread, this, i + t, j, SampleCount, XSizef, YSizef, Shapes, OutputImage));\n\t\t\t\t}\n\t\t\t\tstd::for_each(Threads.begin(), Threads.end(), std::mem_fn(&amp;std::thread::join));\n\t\t\t}\n\t\t}\n\t\t\tstd::cout &lt;&lt; \"Sphere - Total Time Taken (ms):\" &lt;&lt; diff &lt;&lt; \" for thread count:\" &lt;&lt; ThreadCount &lt;&lt; std::endl;\n\t}\n}\t<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"cpp\" data-enlighter-theme=\"godzilla\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">void SphereTest_NaiveThreading::BatchThread(uint32_t i, uint32_t j, uint32_t SampleCount, float XSizef, float YSizef, std::vector&lt;Shape*> Shapes, class Image* OutputImage)\n{\n\tstd::default_random_engine generator;\n\tstd::uniform_real_distribution&lt;float> distribution(-0.5f, 0.5f);\n\n\tColor PixelColor = Color();\n\tfor (uint32_t s = 0; s &lt; SampleCount; ++s)\n\t{\n\t\tfloat U = float(i + distribution(generator)) \/ XSizef;\n\t\tfloat V = float(j + distribution(generator)) \/ YSizef;\n\n\t\tRay TraceRay(Origin, BottomLeft + U * HorizSize + V * VertSize);\n\t\tPixelColor += GetColorForRay(Shapes, TraceRay);\n\t}\n\tPixelColor \/= float(SampleCount);\n}<\/pre>\n\n\n\n<p>Slap it together, add some timing checks to it, and run it.  The images look fine, so I take a look at the timings, and this was the result:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"godzilla\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">Sphere - Total Time Taken (ms):24010 for thread count:4\nSphere - Total Time Taken (ms):21393 for thread count:8\nSphere - Total Time Taken (ms):20014 for thread count:12\nSphere - Total Time Taken (ms):19894 for thread count:16\nSphere - Total Time Taken (ms):19437 for thread count:20\nSphere - Total Time Taken (ms):19452 for thread count:24\nSphere - Total Time Taken (ms):19105 for thread count:28\nSphere - Total Time Taken (ms):19051 for thread count:32<\/pre>\n\n\n\n<p>Oh.<\/p>\n\n\n\n<p>Admittedly I expected that the cost of spinning up the threads was going to be non-trivial, but I wasn&#8217;t expecting <em>that<\/em>.  The naive threading at 32 threads was almost 7x <em>slower<\/em> than the unthreaded pass.  Taking a look at even the basic performance metrics coming from Visual Studio, the batch thread was less than 10% of the total execution time of the program for the 32-thread test.  That&#8217;s pretty fucking bad.  So, there&#8217;s a few obvious problems for me to look at, and it gives me somewhere to start:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>My unit of work for the threads is too small.  That needs to be larger.  For now that can just be brute force making it larger.  In future tests with reflection\/refraction, this would want to be a bit smarter since each work unit wouldn&#8217;t have a fixed ray count, but it gives me somewhere to start.<\/li><li>Doing the join on the worker thread batch could present thread waits in spots where the work size of the thread is an unknown.  In the future, I want to change this to some sort of message pump where completed threads can trigger a new worker thread so we always have close to max threads working on casts.  However, with the work size being fixed right now, this is a lower priority for investigation.  I <em>am<\/em> curious about how much of a gap there is on the joins, and this is where having a working copy of AMD uProf would help me out a lot, but it&#8217;s also not a current focus so I&#8217;m not worried about it for now.<\/li><\/ul>\n\n\n\n<!--nextpage-->\n\n\n\n<h2 class=\"wp-block-heading\">Step 2 &#8211; Increasing the size of a work unit<\/h2>\n\n\n\n<p>Luckily, this step is pretty easy.  Pump more shit into a thread, let it run, see what happens.  It only took a few minutes to make sure I was barking up the correct tree here.  My first-pass on increasing the work size was simply to make each thread run a <em>full row<\/em> on the output image, rather than a single pixel.  Now each thread would run about 80,000 rays instead of 100.  Simple change, large result.  The code ends up being like this:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"cpp\" data-enlighter-theme=\"godzilla\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">uint32_t MaxThreadCount = std::thread::hardware_concurrency();\n{\n\tfor (uint32_t ThreadCount = 4; ThreadCount &lt;= MaxThreadCount; ThreadCount += 4)\n\t{\n\t\tfor (uint32_t j = 0; j &lt; YSize; j += ThreadCount)\n\t\t{\n\t\t\tstd::vector&lt;std::thread> Threads;\n\t\t\tfor (uint32_t t = 0; t &lt; ThreadCount &amp;&amp; (j + t) &lt; YSize; ++t)\n\t\t\t{\n\t\t\t\tThreads.push_back(std::thread(&amp;SphereTest_ThreadIter1::BatchThread, this, j + t, SampleCount, XSize, XSizef, YSizef, Shapes, OutputImage));\n\t\t\t}\n\t\t\tstd::for_each(Threads.begin(), Threads.end(), std::mem_fn(&amp;std::thread::join));\n\t\t}\n\t}\n}<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"cpp\" data-enlighter-theme=\"godzilla\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">void SphereTest_ThreadIter1::BatchThread(uint32_t j, uint32_t SampleCount, uint32_t XSize, float XSizef, float YSizef, std::vector&lt;Shape*> Shapes, class Image* OutputImage)\n{\n\tstd::default_random_engine generator;\n\tstd::uniform_real_distribution&lt;float> distribution(-0.5f, 0.5f);\n\n\tfor (uint32_t i = 0; i &lt; XSize; ++i)\n\t{\n\t\tColor PixelColor = Color();\n\t\tfor (uint32_t s = 0; s &lt; SampleCount; ++s)\n\t\t{\n\t\t\tfloat U = float(i + distribution(generator)) \/ XSizef;\n\t\t\tfloat V = float(j + distribution(generator)) \/ YSizef;\n\n\t\t\tRay TraceRay(Origin, BottomLeft + U * HorizSize + V * VertSize);\n\t\t\tPixelColor += GetColorForRay(Shapes, TraceRay);\n\t\t}\n\t\tPixelColor \/= float(SampleCount);\n\t}\n}<\/pre>\n\n\n\n<p>This gives me the following timings:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"godzilla\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">Sphere Threading Iteration 1 - Total Time Taken (ms):829 for thread count:4\nSphere Threading Iteration 1 - Total Time Taken (ms):429 for thread count:8\nSphere Threading Iteration 1 - Total Time Taken (ms):323 for thread count:12\nSphere Threading Iteration 1 - Total Time Taken (ms):256 for thread count:16\nSphere Threading Iteration 1 - Total Time Taken (ms):211 for thread count:20\nSphere Threading Iteration 1 - Total Time Taken (ms):186 for thread count:24\nSphere Threading Iteration 1 - Total Time Taken (ms):174 for thread count:28\nSphere Threading Iteration 1 - Total Time Taken (ms):239 for thread count:32<\/pre>\n\n\n\n<p>Simple change, <strong>huge<\/strong> results, and a confirmation that I&#8217;m barking up the right tree.  In this case, the worker threads were now ~80% of the total execution time of the program.  It shows the improvements as thread count increases, but starts to peter out a bit as thread count increases.  The jump in cost at the highest thread count is probably presenting a point where the cost of the join and variations in thread completion is causing bottlenecks.<\/p>\n\n\n\n<p>Out of curiosity, I did another quick test where I split each worker thread into some percentage of the image, rather than per-row.  This gave me the following timings:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"godzilla\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">Sphere Threading Iteration 2 - Total Time Taken (ms):822 for thread count:4\nSphere Threading Iteration 2 - Total Time Taken (ms):432 for thread count:8\nSphere Threading Iteration 2 - Total Time Taken (ms):321 for thread count:12\nSphere Threading Iteration 2 - Total Time Taken (ms):232 for thread count:16\nSphere Threading Iteration 2 - Total Time Taken (ms):196 for thread count:20\nSphere Threading Iteration 2 - Total Time Taken (ms):168 for thread count:24\nSphere Threading Iteration 2 - Total Time Taken (ms):158 for thread count:28\nSphere Threading Iteration 2 - Total Time Taken (ms):146 for thread count:32<\/pre>\n\n\n\n<p>This one is a little more expected.  Since I&#8217;m only running N number of total threads for the life of execution, the timing reduction is linear.  This has none of the thread concurrency problems waiting on the worker thread batch join, so the cost of creating threads is eliminated.  This is especially apparent on the high thread count end.  If I were to adapt a message pump to the smaller full-row batch,  I suspect I&#8217;d see a similarly linear drop in the total execution time of the program with some additional overhead for spawning the threads.<\/p>\n\n\n\n<!--nextpage-->\n\n\n\n<h2 class=\"wp-block-heading\">Next Steps and Thoughts<\/h2>\n\n\n\n<p>For a quick first pass on all this, I&#8217;m pretty happy with the results.  At the very least I&#8217;ve gotten a quick look at setting up threads using the C++ standards, rather than the amalgamation of Boost-based nonsense or Unreal Engine-based threads I&#8217;ve used in the past.  I&#8217;ve gotten a rough look at some quick changes to drastically improve the cost of generating the image via some quick changes to the threading setup I was using.  All told, I&#8217;ve only spent about 6-8 hours on the total set of code in place so far, and that includes things like the core raytracing class framework, a rough unit testing framework for running these tests, then of course the individual threading tests in place.<\/p>\n\n\n\n<p>So, for the next steps:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>I need to get some sort of actual profiler working on my machine.  AMD is allegedly fixing uProf for Win 10 in version 3.3, but who knows when that&#8217;s going to come out.  I&#8217;ve had mixed success with Intel VTune on AMD in the past, but I&#8217;ll be looking at that a bit more if uProf updates don&#8217;t pan out.  The basic VS profiling just isn&#8217;t cutting it.<\/li><li>My next plan as far as increasing the complexity of the actual work load is to add reflection and refraction into the threaded work.  That&#8217;ll do a couple things.  For one, it&#8217;s just more complex work that takes longer in isolation.  The second is that it increases the work load in a non-fixed fashion.  The ray count of a block heading to the background is fixed, but once the rays start interacting with shapes you get non-linear growth in ray count.<\/li><li>That presents the next thing, which is to add a message pump so I can spawn threads independent of the cluster of worker threads all finishing.  I can spawn a bunch in a batch, then as individual work units finish I can spawn new threads instead of waiting.<\/li><li>It&#8217;s probably equally likely that just spawning threads as a percentage of the total image may be as quick as doing the message pump, and has a much lower complexity, which is nice.  I&#8217;ll probably go with this path first, then assuming I have a profiler playing nice with me, check if there&#8217;s any real reason for me to thread things further.<\/li><li>Long term I want to add model loading, but that&#8217;s way off in the future.  For now I at least want to add shapes other than spheres just for variety<\/li><\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Being a primarily Unreal-focused developer, I don&#8217;t really spend that much time in standard C++. Ya, technically I work a lot in C++, but C++ using STL and related things is very different from the custom containers and macro-heavy nature of working in Unreal. Part of the fallout of that is that I generally miss &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.blog.dwgames.net\/?p=1072\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Programmer Ramblings &#8211; Screwing Around With std::thread&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[207],"tags":[208],"class_list":["post-1072","post","type-post","status-publish","format-standard","hentry","category-programming","tag-programming"],"_links":{"self":[{"href":"https:\/\/www.blog.dwgames.net\/index.php?rest_route=\/wp\/v2\/posts\/1072","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blog.dwgames.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blog.dwgames.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blog.dwgames.net\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blog.dwgames.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1072"}],"version-history":[{"count":10,"href":"https:\/\/www.blog.dwgames.net\/index.php?rest_route=\/wp\/v2\/posts\/1072\/revisions"}],"predecessor-version":[{"id":1084,"href":"https:\/\/www.blog.dwgames.net\/index.php?rest_route=\/wp\/v2\/posts\/1072\/revisions\/1084"}],"wp:attachment":[{"href":"https:\/\/www.blog.dwgames.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1072"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blog.dwgames.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1072"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blog.dwgames.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1072"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}