From 4437cb497915f067caf3eb1f62978c0185eefb82 Mon Sep 17 00:00:00 2001 From: nick black Date: Thu, 26 Sep 2024 12:03:17 -0400 Subject: [PATCH] Chapter 13 edits (#79) * 12-2: order changes * 13-5: put all markers in backticks * 13-5: en dashes * 13-4: learnings -> lessons * 13-4 article * reorder a few things in the epilog --- ...-0 Optimizing Multithreaded Applications.md | 2 +- .../13-2 Performance scaling and overhead.md | 2 +- .../13-3 Core Count Scaling Case Study.md | 18 +++++++++--------- .../13-4 Task Scheduling.md | 2 +- .../13-6 Cache Coherence Issues.md | 2 +- chapters/15-Epilog/15-0 Epilog.md | 10 +++++----- 6 files changed, 18 insertions(+), 18 deletions(-) diff --git a/chapters/13-Optimizing-Multithreaded-Applications/13-0 Optimizing Multithreaded Applications.md b/chapters/13-Optimizing-Multithreaded-Applications/13-0 Optimizing Multithreaded Applications.md index 6d5c279e5c..5eafedbf51 100644 --- a/chapters/13-Optimizing-Multithreaded-Applications/13-0 Optimizing Multithreaded Applications.md +++ b/chapters/13-Optimizing-Multithreaded-Applications/13-0 Optimizing Multithreaded Applications.md @@ -2,7 +2,7 @@ Modern CPUs are getting more and more cores each year. As of 2024, you can buy a server processor which will have more than 200 cores! And even a mid-range laptop with 16 execution threads is a pretty usual setup nowadays. Since there is so much processing power in every CPU, effective utilization of all the hardware threads becomes more challenging. Preparing software to scale well with a growing amount of CPU cores is very important for the future success of your application. -There is a difference in how server and client products exploit parallelism. Most server platforms are designed to process requests from a large number of customers. Those requests are usually independent of each other, so the server can process them in parallel. If there is enough load on the system, applications themselves could even be single-threaded, and still platform utilization will be high. The situation changes drastically, if you start using your server for HPC or AI computations; then you need all the computing power you have. On the other hand, client platforms, such as laptops and desktops, have all the resources to serve a single user. In this case, an application has to make use of all the available cores to provide the best user experience. +There is a difference in how server and client products exploit parallelism. Most server platforms are designed to process requests from a large number of customers. Those requests are usually independent of each other, so the server can process them in parallel. If there is enough load on the system, applications themselves could even be single-threaded, and still platform utilization will be high. The situation changes drastically if you start using your server for HPC or AI computations; then you need all the computing power you have. On the other hand, client platforms, such as laptops and desktops, have all the resources to serve a single user. In this case, an application has to make use of all the available cores to provide the best user experience. Multithreaded applications have their specifics. Certain assumptions of single-threaded execution become invalid when we start dealing with multiple threads. For example, we can no longer identify hotspots by looking at a single thread since each thread might have its hotspot. In a popular [producer-consumer](https://en.wikipedia.org/wiki/Producer–consumer_problem)[^5] design, the producer thread may sleep most of the time. Profiling such a thread won't shed light on the reason why a multithreaded application is performing poorly. We must concentrate on the critical path that has an impact on the overall performance of the application. diff --git a/chapters/13-Optimizing-Multithreaded-Applications/13-2 Performance scaling and overhead.md b/chapters/13-Optimizing-Multithreaded-Applications/13-2 Performance scaling and overhead.md index c93cc36a0d..22a88c819b 100644 --- a/chapters/13-Optimizing-Multithreaded-Applications/13-2 Performance scaling and overhead.md +++ b/chapters/13-Optimizing-Multithreaded-Applications/13-2 Performance scaling and overhead.md @@ -11,7 +11,7 @@ This effect is widely known as [Amdahl's law](https://en.wikipedia.org/wiki/Amda Amdahl's Law and Universal Scalability Law. -In reality, further adding computing nodes to the system may yield retrograde speed up. We will see examples of it in the next section. This effect is explained by Neil Gunther as [Universal Scalability Law](http://www.perfdynamics.com/Manifesto/USLscalability.html#tth_sEc1)[^8] (USL), which is an extension of Amdahl's law. USL describes communication between computing nodes (threads) as yet another gating factor against performance. As the system is scaled up, overheads start to hinder the gains. Beyond a critical point, the capability of the system starts to decrease (see Figure @fig:MT_USL). USL is widely used for modeling the capacity and scalability of the systems. +In reality, further adding computing nodes to the system may yield retrograde speed up. We will see examples of it in the next section. This effect is explained by Neil Gunther as the [Universal Scalability Law](http://www.perfdynamics.com/Manifesto/USLscalability.html#tth_sEc1)[^8] (USL), which is an extension of Amdahl's law. USL describes communication between computing nodes (threads) as yet another factor gating performance. As the system is scaled up, overheads start to neutralize the gains. Beyond a critical point, the capability of the system starts to decrease (see Figure @fig:MT_USL). USL is widely used for modeling the capacity and scalability of the systems. The slowdowns described by USL are driven by several factors. First, as the number of computing nodes increases, they start to compete for resources (contention). This results in additional time being spent on synchronizing those accesses. Another issue occurs with resources that are shared between many workers. We need to maintain a consistent state of the shared resources between many workers (coherence). For example, when multiple workers frequently change a globally visible object, those changes need to be broadcast to all nodes that use that object. Suddenly, usual operations start getting more time to finish due to the additional need to maintain coherence. Optimizing multithreaded applications not only involves all the techniques described in this book so far but also involves detecting and mitigating the aforementioned effects of contention and coherence. diff --git a/chapters/13-Optimizing-Multithreaded-Applications/13-3 Core Count Scaling Case Study.md b/chapters/13-Optimizing-Multithreaded-Applications/13-3 Core Count Scaling Case Study.md index 335130a7a2..6d2171ae85 100644 --- a/chapters/13-Optimizing-Multithreaded-Applications/13-3 Core Count Scaling Case Study.md +++ b/chapters/13-Optimizing-Multithreaded-Applications/13-3 Core Count Scaling Case Study.md @@ -5,11 +5,11 @@ Thread count scaling is perhaps the most valuable analysis you can perform on a 1. Blender 3.4 - an open-source 3D creation and modeling software project. This test is of Blender's Cycles performance with the BMW27 blend file. URL: [https://download.blender.org/release](https://download.blender.org/release). Command line: `./blender -b bmw27_cpu.blend -noaudio --enable-autoexec -o output.test -x 1 -F JPEG -f 1 -t N`, where `N` is the number of threads. Clang 17 self-build - this test uses clang 17 to build the clang 17 compiler from sources. URL: [https://www.llvm.org](https://www.llvm.org). Command line: `ninja -jN clang`, where `N` is the number of threads. 3. Zstandard v1.5.5, a fast lossless compression algorithm. URL: [https://github.com/facebook/zstd](https://github.com/facebook/zstd). A dataset used for compression: [http://wanos.co/assets/silesia.tar](http://wanos.co/assets/silesia.tar). Command line: `./zstd -TN -3 -f -- silesia.tar`, where `N` is the number of compression worker threads. 4. CloverLeaf 2018 - a Lagrangian-Eulerian hydrodynamics benchmark. All hardware threads are used. This test uses the input file `clover_bm.in` (Problem 5). URL: [http://uk-mac.github.io/CloverLeaf](http://uk-mac.github.io/CloverLeaf). Command line: `export OMP_NUM_THREADS=N; ./clover_leaf`, where `N` is the number of threads. -5. CPython 3.12, a reference implementation of the Python programming language. URL: [https://github.com/python/cpython](https://github.com/python/cpython). We run a simple multithreaded binary search script written in Python, which searches `10'000` random numbers (needles) in a sorted list of `1'000'000` elements (haystack). Command line: `./python3 binary_search.py N`, where `N` is the number of threads. Needles are divided equally between threads. +5. CPython 3.12, a reference implementation of the Python programming language. URL: [https://github.com/python/cpython](https://github.com/python/cpython). We run a simple multithreaded binary search script written in Python, which searches `10,000` random numbers (needles) in a sorted list of `1,000,000` elements (haystack). Command line: `./python3 binary_search.py N`, where `N` is the number of threads. Needles are divided equally between threads. The benchmarks were executed on a machine with the configuration shown below: -* 12th Gen Alderlake Intel(R) Core(TM) i7-1260P CPU @ 2.10GHz (4.70GHz Turbo), 4P+8E cores, 18MB L3-cache. +* 12th Gen Alder Lake Intel® Core™ i7-1260P CPU @ 2.10GHz (4.70GHz Turbo), 4P+8E cores, 18MB L3-cache. * 16 GB RAM, DDR4 @ 2400 MT/s. * Clang 15 compiler with the following options: `-O3 -march=core-avx2`. * 256GB NVMe PCIe M.2 SSD. @@ -37,7 +37,7 @@ There is another aspect of scaling degradation that we will talk about when disc ### Clang {.unlisted .unnumbered} -While Blender uses multithreading to exploit parallelism, concurrency in C++ compilation is usually achieved with multiprocessing. Clang 17 has more than `2'500` translation units, and to compile each of them, a new process is spawned. Similar to Blender, we classify Clang compilation as massively parallel, yet they scale differently. We recommend you revisit [@sec:PerfMetricsCaseStudy] for an overview of Clang compiler performance bottlenecks. In short, it has a large codebase, flat profile, many small functions, and "branchy" code. Its performance is affected by Dcache, Icache, and TLB misses, and branch mispredictions. Clang's thread count scaling is affected by the same scaling issues as Blender: P-cores are more effective than E-cores, and P-core SMT scaling is about `1.1x`. However, there is more. Notice that scaling stops at around 10 threads, and starts to degrade. Let's understand why that happens. +While Blender uses multithreading to exploit parallelism, concurrency in C++ compilation is usually achieved with multiprocessing. Clang 17 has more than `2,500` translation units, and to compile each of them, a new process is spawned. Similar to Blender, we classify Clang compilation as massively parallel, yet they scale differently. We recommend you revisit [@sec:PerfMetricsCaseStudy] for an overview of Clang compiler performance bottlenecks. In short, it has a large codebase, flat profile, many small functions, and "branchy" code. Its performance is affected by D-Cache, I-Cache, and TLB misses, and branch mispredictions. Clang's thread count scaling is affected by the same scaling issues as Blender: P-cores are more effective than E-cores, and P-core SMT scaling is about `1.1x`. However, there is more. Notice that scaling stops at around 10 threads, and starts to degrade. Let's understand why that happens. The problem is related to the frequency throttling. When multiple cores are utilized simultaneously, the processor generates more heat due to the increased workload on each core. To prevent overheating and maintain stability, CPUs often throttle down their clock speeds depending on how many cores are in use. Additionally, boosting all cores to their maximum turbo frequency simultaneously would require significantly more power, which might exceed the power delivery capabilities of the CPU. Our system doesn't possess an advanced liquid cooling solution and only has a single processor fan. That's why it cannot sustain high frequencies when many cores are utilized. @@ -73,19 +73,19 @@ On the image, we have the main thread at the bottom (TID 913273), and eight work On the worker thread timeline (top 8 rows) we have the following markers: -* 'job0' - 'job25' bars indicate the start and end of a job. -* 'ww' (short for "worker wait") bars indicate a period when a worker thread is waiting for a new job. +* `job0`--`job25` bars indicate the start and end of a job. +* `ww` (short for "worker wait") bars indicate a period when a worker thread is waiting for a new job. * Notches below job periods indicate that a thread has just finished compressing a portion of the input block and is signaling to the main thread that the data is available to be partially flushed. On the main thread timeline (row 9, TID 913273) we have the following markers: -* 'p0' - 'p25' boxes indicate periods of preparing a new job. It starts when the main thread starts filling up the input buffer until it is full (but this new job is not necessarily posted on the worker queue immediately). -* 'fw' (short for "flush wait") markers indicate a period when the main thread waits for the produced data to start flushing it. During this time, the main thread is blocked. +* `p0`--`p25` boxes indicate periods of preparing a new job. It starts when the main thread starts filling up the input buffer until it is full (but this new job is not necessarily posted on the worker queue immediately). +* `fw` (short for "flush wait") markers indicate a period when the main thread waits for the produced data to start flushing it. During this time, the main thread is blocked. With a quick glance at the image, we can tell that there are many `ww` periods when worker threads are waiting. This negatively affects the performance of Zstandard compression. Let's progress through the timeline and try to understand what's going on. 1. First, when worker threads are created, there is no work to do, so they are waiting for the main thread to post a new job. -2. Then the main thread starts to fill up the input buffers for the worker threads. It has prepared jobs 0 to 7 (see bars `p0` - `p7`), which were picked up by worker threads immediately. Notice, that the main thread also prepared `job8` (`p8`), but it hasn't posted it in the worker queue yet. This is because all workers are still busy. +2. Then the main thread starts to fill up the input buffers for the worker threads. It has prepared jobs 0 to 7 (see bars `p0`--`p7`), which were picked up by worker threads immediately. Notice, that the main thread also prepared `job8` (`p8`), but it hasn't posted it in the worker queue yet. This is because all workers are still busy. 3. After the main thread has finished `p8`, it flushed the data already produced by `job0`. Notice, that by this time, `job0` has already delivered five portions of compressed data (first five notches below the `job0` bar). Now, the main thread enters its first `fw` period and starts to wait for more data from `job0`. 4. At the timestamp `45ms`, one more chunk of compressed data is produced by `job0`, and the main thread briefly wakes up to flush it, see \circled{1}. After that, it goes to sleep again. 5. `Job3` is the first to finish, but there is a couple of milliseconds delay before TID 913309 picks up the new job, see \circled{2}. This happens because `job8` was not posted in the queue by the main thread. Luckily, the new portion of compressed data comes from `job0`, so the main thread wakes up, flushes it, and notices that there are idle worker threads. So, it posts `job8` to the worker queue and starts preparing the next job (`p9`). @@ -93,7 +93,7 @@ With a quick glance at the image, we can tell that there are many `ww` periods w 7. We should have expected that the main thread would start preparing `job11` immediately after `job10` was posted in the queue as it did before. But it didn't. This happens because there are no available input buffers. We will discuss it in more detail shortly. 8. Only when `job0` finishes, the main thread was able to acquire a new input buffer and start preparing `job11` (see \circled{5}). -As we just said, the reason for the 20-40ms delays between jobs is the lack of input buffers, which are required to start preparing a new job. Zstd maintains a single memory pool, which allocates space for both input and output buffers. This memory pool is prone to fragmentation issues, as it has to provide contiguous blocks of memory. When a worker finishes a job, the output buffer is waiting to be flushed, but it still occupies memory. And to start working on another job, it will require another pair of buffers. +As we just said, the reason for the 20--40ms delays between jobs is the lack of input buffers, which are required to start preparing a new job. Zstd maintains a single memory pool, which allocates space for both input and output buffers. This memory pool is prone to fragmentation issues, as it has to provide contiguous blocks of memory. When a worker finishes a job, the output buffer is waiting to be flushed, but it still occupies memory. And to start working on another job, it will require another pair of buffers. Limiting the capacity of the memory pool is a design decision to reduce memory consumption. In the worst case, there could be many "run-away" buffers, left by workers that have completed their jobs very fast, and move on to process the next job; meanwhile, the flush queue is still blocked by one slow job. In such a scenario, the memory consumption would be very high, which is undesirable. However, the downside here is increased wait time between the jobs. diff --git a/chapters/13-Optimizing-Multithreaded-Applications/13-4 Task Scheduling.md b/chapters/13-Optimizing-Multithreaded-Applications/13-4 Task Scheduling.md index 0ead34a17b..15c237c779 100644 --- a/chapters/13-Optimizing-Multithreaded-Applications/13-4 Task Scheduling.md +++ b/chapters/13-Optimizing-Multithreaded-Applications/13-4 Task Scheduling.md @@ -44,7 +44,7 @@ Our second piece of advice is to avoid static partitioning on systems with asymm In the final example, we switch to using dynamic partitioning. With dynamic partitioning, chunks are distributed to threads dynamically. Each thread processes a chunk of elements, then requests another chunk, until no chunks remain to be distributed. Figure @fig:OmpDynamic shows the result of using dynamic partitioning by dividing the array into 16 chunks. With this scheme, each task becomes more granular, which enables OpenMP runtime to balance the work even when P-cores run two times faster than E-cores. However, notice that there is still some idle time on E-cores. -Performance can be slightly improved if we partition the work into 128 chunks instead of 16. But don't make the jobs too small, otherwise it will result in increased management overhead. The result summary of our experiments is shown in Table [@tbl:TaskSchedulingResults]. Partitioning the work into 128 chunks turns out to be the sweet spot for our example. Even though our example is very simple, learnings from it can be applied to production-grade multithreaded software. +Performance can be slightly improved if we partition the work into 128 chunks instead of 16. But don't make the jobs too small, otherwise it will result in increased management overhead. The result summary of our experiments is shown in Table [@tbl:TaskSchedulingResults]. Partitioning the work into 128 chunks turns out to be the sweet spot for our example. Even though our example is very simple, lessons from it can be applied to production-grade multithreaded software. ------------------------------------------------------------------------------------------------ Affinity Static Dynamic, Dynamic, Dynamic, Dynamic, diff --git a/chapters/13-Optimizing-Multithreaded-Applications/13-6 Cache Coherence Issues.md b/chapters/13-Optimizing-Multithreaded-Applications/13-6 Cache Coherence Issues.md index 564a53db34..d732d4bcc3 100644 --- a/chapters/13-Optimizing-Multithreaded-Applications/13-6 Cache Coherence Issues.md +++ b/chapters/13-Optimizing-Multithreaded-Applications/13-6 Cache Coherence Issues.md @@ -31,7 +31,7 @@ unsigned int sum; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -First of all, true sharing implies data races that can be tricky to detect. Fortunately, there are tools that can help identify such issues. [Thread sanitizer](https://clang.llvm.org/docs/ThreadSanitizer.html)[^30] from Clang and [helgrind](https://www.valgrind.org/docs/manual/hg-manual.html)[^31] are among such tools. To prevent data race in [@lst:TrueSharing], you should declare the `sum` variable as `std::atomic sum`. +First of all, true sharing implies data races that can be tricky to detect. Fortunately, there are tools that can help identify such issues. [Thread sanitizer](https://clang.llvm.org/docs/ThreadSanitizer.html)[^30] from Clang and [helgrind](https://www.valgrind.org/docs/manual/hg-manual.html)[^31] are among such tools. To prevent the data race in [@lst:TrueSharing], you should declare the `sum` variable as `std::atomic sum`. Using C++ atomics can help to solve data races when true sharing happens. However, it effectively serializes accesses to the atomic variable, which may hurt performance. Another way of solving true sharing issues is by using Thread Local Storage (TLS). TLS is the method by which each thread in a given multithreaded process can allocate memory to store thread-specific data. By doing so, threads modify their local copies instead of contending for a globally available memory location. The example in [@lst:TrueSharing] can be fixed by declaring `sum` with a TLS class specifier: `thread_local unsigned int sum` (since C++11). The main thread should then incorporate results from all the local copies of each worker thread. diff --git a/chapters/15-Epilog/15-0 Epilog.md b/chapters/15-Epilog/15-0 Epilog.md index 6da884c082..0c79f0eb8d 100644 --- a/chapters/15-Epilog/15-0 Epilog.md +++ b/chapters/15-Epilog/15-0 Epilog.md @@ -5,23 +5,23 @@ Thanks for reading through the whole book. I hope you enjoyed it and found it useful. I would be even happier if the book would help you solve a real-world problem. In such a case, I would consider it a success and proof that my efforts were not wasted. Before you continue with your endeavors, let me briefly highlight the essential points of the book and give you final recommendations: -* Single-threaded CPU performance is not increasing as rapidly as it used to a few decades ago. When it's no longer the case that each hardware generation provides a significant performance boost, developers should start optimizing the code of their software. * Modern software is massively inefficient. There are big optimization opportunities to reduce carbon emissions and make a better user experience. People hate using slow software, especially when their productivity goes down because of it. Not all fast software is world-class, but all world-class software is fast. Performance is _the_ killer feature. * For many years performance engineering was a nerdy niche. But now it is mainstream as software vendors have realized the impact that their poorly optimized software has on their bottom line. Performance tuning is more critical than it has been for the last 40 years. It will be one of the key drivers for performance gains in the near future. * The importance of low-level performance tuning should not be underestimated, even if it's just a 1% improvement. The cumulative effect of these small improvements is what makes the difference. * There is a famous quote by Donald Knuth: "Premature optimization is the root of all evil". But the opposite is often true as well. Postponed performance engineering work may be too late and cause as much evil as premature optimization. Do not neglect performance aspects when designing your future products. Save your project by integrating automated performance benchmarking into your CI/CD pipeline. *Measure early, measure often.* +* Single-threaded CPU performance is not increasing as rapidly as it used to a few decades ago. When it's no longer the case that each hardware generation provides a significant performance boost, developers should start optimizing the code of their software. * The performance of modern CPUs is not deterministic and depends on many factors. Meaningful performance analysis should account for noise and use statistical methods to analyze performance measurements. * Knowledge of the CPU microarchitecture is required to reach peak performance. However, your mental model can never be as accurate as the actual microarchitecture design of a CPU. So don't solely rely on your intuition when you make a specific change in your code. Predicting the performance of a particular piece of code is nearly impossible. *Always measure!* * Performance engineering is hard because there are no predetermined steps you should follow, no algorithm. Engineers need to tackle problems from different angles. Know performance analysis methods and tools (both hardware and software) that are available to you. I strongly suggest embracing the TMA methodology and Roofline performance model if they are available on your platform. They will help you to steer your work in the right direction. -* Nowadays, there is a variety of performance analysis tools available on each platform. Spend some time to learn how to use them, what they can do, and how to interpret the results they provide. -* When you identify the performance-limiting factor of your application, consider you are more than halfway to solving the problem. Based on my experience, the fix is often easier than finding the root cause of the problem. +* Nowadays, there are a variety of performance analysis tools available on each platform. Spend some time to learn how to use them, what they can do, and how to interpret the results they provide. +* When you identify the performance-limiting factor of your application, you are more than halfway to solving the problem. Based on my experience, the fix is often easier than finding the root cause of the problem. In Part 2 we covered some essential optimizations for every type of CPU performance bottleneck: how to optimize memory accesses and computations, how to get rid of branch mispredictions, how to improve machine code layout, and a few others. Use chapters from Part 2 as a reference to see what options are available when your application has one of these problems. * Multithreaded programs add one more dimension of complexity to performance tuning. They introduce new types of bottlenecks and require additional tools and methods to analyze and optimize. Examining how an application scales with the number of threads is an effective way to identify bottlenecks in multithreaded programs. -* Processors from different vendors are not created equal. They differ in terms of instruction set architecture (ISA) that they support and microarchitecture implementation. Reaching peak performance on a given platform requires utilizing the latest ISA extensions and tuning your code according to the specifics of a particular CPU microarchitecture. +* Processors from different vendors are not created equal. They differ in terms of instruction set architecture (ISA) supported and microarchitectural implementation. Reaching peak performance on a given platform requires utilizing the latest ISA extensions and tuning your code according to the specifics of a particular CPU microarchitecture. * The previous bullet point sometimes leads to unnecessary code complications. If the performance benefit of a modification is negligible, you should keep the code in its most simple and clean form. * Sometimes modifications that improve performance on one system slow down execution on another system. Make sure you test your changes on all the platforms that you care about. -I hope you now have a better understanding of low-level performance optimizations. I also hope it will help to better understand performance bottlenecks in your application. Of course, this book doesn't cover every possible scenario you may encounter in your daily job. My goal was to give you a starting point and to show you potential options and strategies for dealing with performance analysis and tuning on modern CPUs. I recognize that it is just the beginning and that there is much more to learn. I wish you experience the joy of discovering performance bottlenecks in your application and the satisfaction of fixing them. +I hope you now have a better understanding of low-level performance optimizations. I also hope you will better understand performance bottlenecks in your application. Of course, this book doesn't cover every possible scenario you may encounter in your daily job. My goal was to give you a starting point and to show you potential options and strategies for dealing with performance analysis and tuning on modern CPUs. I recognize that it is just the beginning and that there is much more to learn. I wish you experience the joy of discovering performance bottlenecks in your application and the satisfaction of fixing them. If you enjoyed reading this book, make sure to pass it on to your friends and colleagues. I would appreciate your help in spreading the word about the book by endorsing it on social media platforms.