{"id":1085,"date":"2020-07-22T00:18:57","date_gmt":"2020-07-22T04:18:57","guid":{"rendered":"https:\/\/www.blog.dwgames.net\/?p=1085"},"modified":"2020-07-22T00:19:31","modified_gmt":"2020-07-22T04:19:31","slug":"programmer-ramblings-continued-screwing-with-stdthread","status":"publish","type":"post","link":"https:\/\/www.blog.dwgames.net\/?p=1085","title":{"rendered":"Programmer Ramblings &#8211; Continued Screwing with std::thread"},"content":{"rendered":"\n<p>This is a followup to <a href=\"https:\/\/www.blog.dwgames.net\/?p=1072\">yesterday&#8217;s ramblings<\/a> about initial work screwing with std::thread.<\/p>\n\n\n\n<p>In talking with <a href=\"https:\/\/twitter.com\/qulntosh\">Quintosh<\/a> on Twitter, he asked why I didn&#8217;t go with some sort of thread pool.  Frankly, it&#8217;s because I&#8217;m lazy.  However, it did give me an idea.  The data that I&#8217;m working on is fairly parallelizable without memory collisions.  Rather than creating a thread pool where I hand work <em>to<\/em> threads, it would be a lot lower amount of work for a simple program if I let the threads directly request work themselves.  This gave me a few things:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>A unit of work could be much smaller, so cores that are running slower because of other system processes don&#8217;t present blocks.<\/li><li>Once I implement other things that scale the program non-linearly (ex: reflection\/refraction), I don&#8217;t have to worry about intelligently breaking up the work into equal sizes.<\/li><li>The only member that is accessed across many threads is the piece of data controlling the work request.  This keeps my actual locks to a minimum.<\/li><\/ul>\n\n\n\n<p>Much to my surprise, I also managed to get Intel VTune working completely.  This helped confirm some of my assumptions from yesterday&#8217;s article, so I&#8217;ll cover that in some detail later on.<\/p>\n\n\n\n<!--nextpage-->\n\n\n\n<h2 class=\"wp-block-heading\">Making the threads work for me<\/h2>\n\n\n\n<p>So for this test, I basically went with this setup:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Create N threads at startup to live for the lifetime of the project.<\/li><li>Each thread individually will request new work as it completes its current task until the program tells it there&#8217;s nothing left and the thread shuts down.<\/li><li>On all threads shutting down, I can exit.<\/li><\/ul>\n\n\n\n<p>For a quick starting reference, here&#8217;s timings of the last iteration that ended yesterday&#8217;s ramblings and the timings of this new iteration side by side:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"godzilla\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">Sphere Threading Iteration 2 - Total Time Taken (ms):804 for thread count:4\nSphere Threading Iteration 2 - Total Time Taken (ms):433 for thread count:8\nSphere Threading Iteration 2 - Total Time Taken (ms):298 for thread count:12\nSphere Threading Iteration 2 - Total Time Taken (ms):237 for thread count:16\nSphere Threading Iteration 2 - Total Time Taken (ms):192 for thread count:20\nSphere Threading Iteration 2 - Total Time Taken (ms):166 for thread count:24\nSphere Threading Iteration 2 - Total Time Taken (ms):154 for thread count:28\nSphere Threading Iteration 2 - Total Time Taken (ms):145 for thread count:32\n\nSphere Threading Iteration 3 - Total Time Taken (ms):717 for thread count:4\nSphere Threading Iteration 3 - Total Time Taken (ms):372 for thread count:8\nSphere Threading Iteration 3 - Total Time Taken (ms):265 for thread count:12\nSphere Threading Iteration 3 - Total Time Taken (ms):210 for thread count:16\nSphere Threading Iteration 3 - Total Time Taken (ms):173 for thread count:20\nSphere Threading Iteration 3 - Total Time Taken (ms):150 for thread count:24\nSphere Threading Iteration 3 - Total Time Taken (ms):136 for thread count:28\nSphere Threading Iteration 3 - Total Time Taken (ms):127 for thread count:32<\/pre>\n\n\n\n<p>In good news, this scaled linearly in the same fashion and even showed some performance improvements at the end.  For this specific test, I set it up where each thread can request a unit of work that comprises one complete row of my output image at a time.  This matches with the first iteration that I messed around with yesterday.  In that case, going to 32 threads resulted in me starting to see a distinct drop in performance, with my speculating that it was because of potential cost of the join wait and variations in thread ending.  In getting VTune working, I&#8217;ve confirmed that for sure, as well as confirmed why this 3rd iteration worked better in practice than the 2nd iteration with fixed percentage work units.  I&#8217;ll cover that a bit more in detail in the next page.<\/p>\n\n\n\n<p>Here&#8217;s the unit test I wrote to test out this idea:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"cpp\" data-enlighter-theme=\"godzilla\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">void SphereTest_ThreadIter3::DoTest()\n{\n\tuint32_t MaxThreadCount = std::thread::hardware_concurrency();\n\t{\n\t\t\/\/ Do multiple loops to test performance at different thread counts\n\t\tfor (uint32_t ThreadCount = 4; ThreadCount &lt;= MaxThreadCount; ThreadCount += 4)\n\t\t{\n\t\t\tCurrentRow = 0;\n\t\t\t\/\/ Create the image - I stripped out stuff here with writing out the image, deleting the pointer, etc to clean up the code\n\t\t\tif (Image* OutputImage = new Image(FileName.str()))\n\t\t\t{\n\t\t\t\t\/\/ Create the threads and set them loose.  Wait for them after the loop until they all complete\n\t\t\t\tstd::vector&lt;std::thread> Threads;\n\t\t\t\tfor (uint32_t t = 0; t &lt; ThreadCount; ++t)\n\t\t\t\t{\n\t\t\t\t\tThreads.push_back(std::thread(&amp;SphereTest_ThreadIter3::BatchThread, this, OutputImage));\n\t\t\t\t}\n\t\t\t\tstd::for_each(Threads.begin(), Threads.end(), std::mem_fn(&amp;std::thread::join));\n\t\t\t}\n\t\t}\n\t}\n}\n\nvoid SphereTest_ThreadIter3::BatchThread(class Image* OutputImage)\n{\n\tstd::default_random_engine generator;\n\tstd::uniform_real_distribution&lt;float> distribution(-0.5f, 0.5f);\n\n\tuint32_t Row = 0;\n\t\/\/ In this test, a work unit is a full row of the output image.  The class data keeps track of which row should be the next request, hands it up to the thread, then the data is crunched.  The rest is a standard raytracing pass, spitting out n rays per-pixel of the row to get color data back.\n\twhile (RequestWork(Row))\n\t{\n\t\tfor (uint32_t i = 0; i &lt; XSize; ++i)\n\t\t{\n\t\t\tColor PixelColor = Color();\n\t\t\tfor (uint32_t s = 0; s &lt; SampleCount; ++s)\n\t\t\t{\n\t\t\t\tfloat U = float(i + distribution(generator)) \/ XSizef;\n\t\t\t\tfloat V = float(Row + distribution(generator)) \/ YSizef;\n\n\t\t\t\tRay TraceRay(Origin, BottomLeft + U * HorizSize + V * VertSize);\n\t\t\t\tPixelColor += GetColorForRay(Shapes, TraceRay);\n\t\t\t}\n\t\t\tPixelColor \/= float(SampleCount);\n\n\t\t\tOutputImage->SetPixel(i, Row, PixelColor);\n\t\t}\n\t}\t\t\n}\n\n\/\/ Simple function to return the current work unit.  I probably don't need the mutex on the row increment, but being extra safe.  If this returns false, the while loop in the worker thread function exist and the thread completes.\nbool SphereTest_ThreadIter3::RequestWork(uint32_t&amp; Row)\n{\n\tif (CurrentRow >= YSize)\n\t{\n\t\treturn false;\n\t}\n\n\tstd::mutex mtx;\n\tmtx.lock();\n\tRow = CurrentRow;\n\tCurrentRow++;\n\tmtx.unlock();\n\n\treturn true;\n}<\/pre>\n\n\n\n<p>Being perfectly honest, I likely don&#8217;t need the mutex in the work request function, but I&#8217;m being super safe there.  It may save a few ms of total execution time.  However, the important thing here is that this pattern will easily adjust to me doing non-fixed size work units where bounce rays start to make each row&#8217;s total ray count an unknown.<\/p>\n\n\n\n<!--nextpage-->\n\n\n\n<h2 class=\"wp-block-heading\">Data from VTune \/ Final Thoughts for Tonight<\/h2>\n\n\n\n<p>So part of my other future goals from yesterday was to get <em>some<\/em> profiler working.  I&#8217;ve used VTune in the past on Intel machines to high level of success, but I&#8217;d had trouble in some cases with AMD hardware.  It&#8217;s been a while since I&#8217;ve tried that, but luckily it installed and worked pretty flawlessly off the bat.   This gave me some good information that confirmed some of my assumptions from yesterday.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"636\" height=\"893\" src=\"https:\/\/www.blog.dwgames.net\/wp-content\/uploads\/2020\/07\/image.png\" alt=\"\" class=\"wp-image-1087\" srcset=\"https:\/\/www.blog.dwgames.net\/wp-content\/uploads\/2020\/07\/image.png 636w, https:\/\/www.blog.dwgames.net\/wp-content\/uploads\/2020\/07\/image-214x300.png 214w\" sizes=\"auto, (max-width: 636px) 100vw, 636px\" \/><\/figure>\n\n\n\n<p>This first picture is a piece of the thread graph from yesterday&#8217;s iteration one.  What it&#8217;s showing is that I can get a lot of threads running in parallel together, but it confirms that I&#8217;ve got some problems:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>The threads never really run a large amount of work at once to stress the machine, so I&#8217;m wasting a lot of cycles.<\/li><li>The threads spend a (relatively) high amount of time just waiting on other threads to complete.<\/li><li>This leaves a big gap before the next block of threads can get started.<\/li><\/ul>\n\n\n\n<p>And yes, the gaps are exacerbated by the presence of the profiler, but it just further proved my assumptions in that I don&#8217;t want to introduce places where I&#8217;m just waiting on stuff.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2046\" height=\"963\" src=\"https:\/\/www.blog.dwgames.net\/wp-content\/uploads\/2020\/07\/image-1.png\" alt=\"\" class=\"wp-image-1088\" srcset=\"https:\/\/www.blog.dwgames.net\/wp-content\/uploads\/2020\/07\/image-1.png 2046w, https:\/\/www.blog.dwgames.net\/wp-content\/uploads\/2020\/07\/image-1-300x141.png 300w, https:\/\/www.blog.dwgames.net\/wp-content\/uploads\/2020\/07\/image-1-1024x482.png 1024w, https:\/\/www.blog.dwgames.net\/wp-content\/uploads\/2020\/07\/image-1-768x361.png 768w, https:\/\/www.blog.dwgames.net\/wp-content\/uploads\/2020\/07\/image-1-1536x723.png 1536w\" sizes=\"auto, (max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px\" \/><\/figure>\n\n\n\n<p>This picture is the thread graph from the second iteration test yesterday.  It shows a fairly nice use of the entire CPU as the threads come online and start operating on their work unit.  However, the threads end at hugely varying times based on whether or not they crunch through their work or get stuck waiting on something else on their core.  It&#8217;s a distinct improvement, but it would get worse once I add reflection\/refraction and make the work units non-fixed in size.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2251\" height=\"909\" src=\"https:\/\/www.blog.dwgames.net\/wp-content\/uploads\/2020\/07\/image-2.png\" alt=\"\" class=\"wp-image-1089\" srcset=\"https:\/\/www.blog.dwgames.net\/wp-content\/uploads\/2020\/07\/image-2.png 2251w, https:\/\/www.blog.dwgames.net\/wp-content\/uploads\/2020\/07\/image-2-300x121.png 300w, https:\/\/www.blog.dwgames.net\/wp-content\/uploads\/2020\/07\/image-2-1024x414.png 1024w, https:\/\/www.blog.dwgames.net\/wp-content\/uploads\/2020\/07\/image-2-768x310.png 768w, https:\/\/www.blog.dwgames.net\/wp-content\/uploads\/2020\/07\/image-2-1536x620.png 1536w, https:\/\/www.blog.dwgames.net\/wp-content\/uploads\/2020\/07\/image-2-2048x827.png 2048w\" sizes=\"auto, (max-width: 767px) 89vw, (max-width: 1000px) 54vw, (max-width: 1071px) 543px, 580px\" \/><\/figure>\n\n\n\n<p>This final image is the iteration from tonight.  What we&#8217;ve now got is a situation where the threads are all there, but they can work a lot more efficiently based on the resources that they&#8217;re given.  Threads spinning up later simply do less work.  Threads that get blocked by other system resource needs can slow down.  However, overall we&#8217;ve got a much more consistent period of running at higher total CPU utilization.  Importantly, the threads also all manage to finish significantly closer together, minimizing my time spent waiting on them to complete at the end.<\/p>\n\n\n\n<p>This particular variation is what I&#8217;ll be using when I move forward with my next set of features in the ray tracer, and at that point I also plan on comparing the performance of a row as a work unit vs. a pixel as a work unit to see if this new pattern is more conducive to that smaller work unit.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is a followup to yesterday&#8217;s ramblings about initial work screwing with std::thread. In talking with Quintosh on Twitter, he asked why I didn&#8217;t go with some sort of thread pool. Frankly, it&#8217;s because I&#8217;m lazy. However, it did give me an idea. The data that I&#8217;m working on is fairly parallelizable without memory collisions. &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.blog.dwgames.net\/?p=1085\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Programmer Ramblings &#8211; Continued Screwing with std::thread&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[207],"tags":[208],"class_list":["post-1085","post","type-post","status-publish","format-standard","hentry","category-programming","tag-programming"],"_links":{"self":[{"href":"https:\/\/www.blog.dwgames.net\/index.php?rest_route=\/wp\/v2\/posts\/1085","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blog.dwgames.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blog.dwgames.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blog.dwgames.net\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blog.dwgames.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1085"}],"version-history":[{"count":5,"href":"https:\/\/www.blog.dwgames.net\/index.php?rest_route=\/wp\/v2\/posts\/1085\/revisions"}],"predecessor-version":[{"id":1093,"href":"https:\/\/www.blog.dwgames.net\/index.php?rest_route=\/wp\/v2\/posts\/1085\/revisions\/1093"}],"wp:attachment":[{"href":"https:\/\/www.blog.dwgames.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1085"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blog.dwgames.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1085"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blog.dwgames.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1085"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}