diff --git a/chapters/0-Preface/0-2 Preface.md b/chapters/0-Preface/0-2 Preface.md index 2379dcdf7a..be3b6ed811 100644 --- a/chapters/0-Preface/0-2 Preface.md +++ b/chapters/0-Preface/0-2 Preface.md @@ -30,7 +30,9 @@ I joined Intel in 2017, but even before that I never shied away from software op I sincerely hope that this book will help you learn low-level performance analysis, and, if you make your application faster as a result, I will consider my mission accomplished. -You will find that I use "we" instead of "I" in many places in the book. This is because I received a lot of help from other people. The PDF version of this book and the "Performance Ninja" online course are available for free. This is my way to give back to the community. The full list of contributors can be found at the end of the book in the "Acknowledgements" section. +You will find that I use "we" instead of "I" in many places in the book. This is because I received a lot of help from other people. The full list of contributors can be found at the end of the book in the "Acknowledgements" section. + +The PDF version of this book and the "Performance Ninja" online course are available for free. This is my way to give back to the community. ## Target Audience {.unlisted .unnumbered} @@ -40,7 +42,7 @@ This book will also be useful for any developer who wants to understand the perf Readers are expected to have a minimal background in C/C++ programming languages to understand the book's examples. The ability to read basic x86/ARM assembly is desired but is not a strict requirement. I also expect familiarity with basic concepts of computer architecture and operating systems like central processor, memory, process, thread, virtual and physical memory, context switch, etc. If any of the mentioned terms are new to you, I suggest studying this material first. -I suggest you read the book chapter by chapter, starting from the beginning. If you consider yourself a beginner in performance analysis, I do not recommend skipping chapters. After you finish reading, this book can be used as a reference or a checklist for optimizing software applications. The second part of the book can be a source of ideas for code optimizations. +I suggest you read the book chapter by chapter, starting from the beginning. If you consider yourself a beginner in performance analysis, I do not recommend skipping chapters. After you finish reading, you can use this book as a a source of ideas whenever you face a performance issue and it's not immediately clear how to fix it. You can skim through the second part of the book to see which optimizations techniques can be applied to your code. [TODO]: put a link to an errata webpage diff --git a/chapters/1-Introduction/1-0 Introduction.md b/chapters/1-Introduction/1-0 Introduction.md index 820d613eb2..fd90573a48 100644 --- a/chapters/1-Introduction/1-0 Introduction.md +++ b/chapters/1-Introduction/1-0 Introduction.md @@ -2,11 +2,15 @@ They say, "Performance is king". It was true a decade ago, and it certainly is now. According to [@Domo2017], in 2017, the world has been creating 2.5 quintillion[^1] bytes of data every day, and as predicted in [@Statista2024], it will reach 400 quintillion bytes per day in 2024. In our increasingly data-centric world, the growth of information exchange fuels the need for both faster software and faster hardware. -Software programmers have had an "easy ride" for decades, thanks to Moore’s law. It used to be the case that some software vendors preferred to wait for a new generation of hardware to speed up their software products and did not spend human resources on making improvements in the code. By looking at Figure @fig:50YearsProcessorTrend, we can see that single-threaded[^2] performance growth is slowing down. +Software programmers have had an "easy ride" for decades, thanks to Moore’s law. It used to be the case that some software vendors preferred to wait for a new generation of hardware to speed up their software products and did not spend human resources on making improvements in their code. By looking at Figure @fig:50YearsProcessorTrend, we can see that single-threaded[^2] performance growth is slowing down. From 1990 to 2000, single-threaded performance grew by a factor of approximately 25 to 30 times based on SPECint benchmarks. The increase in CPU frequency was the key factor driving performance growth. ![50 Years of Microprocessor Trend Data. *© Image by K. Rupp via karlrupp.net*. Original data up to the year 2010 was collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten. New plot and data collected for 2010-2021 by K. Rupp.](../../img/intro/50-years-processor-trend.png){#fig:50YearsProcessorTrend width=100%} -The original interpretation of Moore's law is still standing, though, as transistor count in modern processors maintains its trajectory. For instance, the number of transistors in Apple chips grew from 16 billion in M1 to 20 billion in M2, to 25 billion in M3, to 28 billion in M4 in a span of roughly four years. The growth in transistor count enables manufacturers to add more cores to a processor. As of 2024, you can buy a high-end server processor that will have more than 100 logical cores on a single CPU socket. This is very impressive, unfortunately, it doesn't always translate into better performance. Very often, application performance doesn't scale with extra CPU cores. +However from 2000 to 2010, single-threaded CPU performance growth was more modest compared to the previous decade (approximately 4 to 5 times). Clock speed stagnated due to a combination of power consumption, heat dissipation challenges, limitations in voltage scaling (Dennard Scaling[^3]), and other fundamental problems. Despite slower clock speed improvements, architectural advancements continued, including better branch prediction, deeper pipelines, larger caches, and more efficient execution units. + +From 2010 to 2020, single-threaded performance grew only by about 2 to 3 times. During this period, CPU manufacturers began to focus more on multi-core processors and parallelism rather than solely increasing single-threaded performance. + +The original interpretation of Moore's law is still standing, as transistor count in modern processors maintains its trajectory. For instance, the number of transistors in Apple chips grew from 16 billion in M1 to 20 billion in M2, to 25 billion in M3, to 28 billion in M4 in a span of roughly four years. The growth in transistor count enables manufacturers to add more cores to a processor. As of 2024, you can buy a high-end server processor that will have more than 100 logical cores on a single CPU socket. This is very impressive, unfortunately, it doesn't always translate into better performance. Very often, application performance doesn't scale with extra CPU cores. When it's no longer the case that each hardware generation provides a significant performance boost, we must start paying more attention to how fast our code runs. When seeking ways to improve performance, developers should not rely on hardware. Instead, they should start optimizing the code of their applications. @@ -14,3 +18,4 @@ When it's no longer the case that each hardware generation provides a significan [^1]: A quintillion is a thousand raised to the power of six (10^18^). [^2]: Single-threaded performance is the performance of a single hardware thread inside a CPU core when measured in isolation. +[^3]: Dennard Scaling - [https://en.wikipedia.org/wiki/Dennard_scaling](https://en.wikipedia.org/wiki/Dennard_scaling) \ No newline at end of file diff --git a/chapters/1-Introduction/1-1 Why Software Is Slow.md b/chapters/1-Introduction/1-1 Why Software Is Slow.md index 54e5a0b7e4..d18025e10e 100644 --- a/chapters/1-Introduction/1-1 Why Software Is Slow.md +++ b/chapters/1-Introduction/1-1 Why Software Is Slow.md @@ -1,6 +1,6 @@ ## Why Software Is Slow? -If all the software in the world would magically start utilizing all available hardware resources efficiently, then this book would not exist. We would not need any changes on the software side and would rely on what existing processors have to offer. But you already know that the reality is different, right? The reality is that modern software is *massively* inefficient. A regular server system in a public cloud, typically runs poorly optimized code, consuming more power than it could have consumed, which increases carbon emissions and contributes to other environmental issues. If we could make all software run two times faster, this would reduce the carbon footprint of computing by a factor of two. +If all the software in the world would magically start utilizing all available hardware resources efficiently, then this book would not exist. We would not need any changes on the software side and would rely on what existing processors have to offer. But you already know that the reality is different, right? The reality is that modern software is *massively* inefficient. A regular server system in a public cloud, typically runs poorly optimized code, consuming more power than it could have consumed, which increases carbon emissions and contributes to other environmental issues. If we could make all software run two times faster, this would potentially reduce the carbon footprint of computing by a factor of two. The authors of the paper [@Leisersoneaam9744] provide an excellent example that illustrates the performance gap between "default" and highly optimized software. Table @tbl:PlentyOfRoom summarizes speedups from performance engineering a program that multiplies two 4096-by-4096 matrices. The end result of applying several optimizations is a program that runs over 60,000 times faster. The reason for providing this example is not to pick on Python or Java (which are great languages), but rather to break beliefs that software has "good enough" performance by default. The majority of programs are within rows 1-5. The potential for source-code-level improvements is significant. @@ -30,7 +30,7 @@ Table: Speedups from performance engineering a program that multiplies two 4096- So, let's talk about what prevents systems from achieving optimal performance by default. Here are some of the most important factors: 1. **CPU limitations**: it's so tempting to ask: "*Why doesn't hardware solve all our problems?*" Modern CPUs execute instructions at incredible speed and are getting better with every generation. But still, they cannot do much if instructions that are used to perform the job are not optimal or even redundant. Processors cannot magically transform suboptimal code into something that performs better. For example, if we implement a sorting routine using BubbleSort algorithm, a CPU will not make any attempts to recognize it and use better alternatives, for example, QuickSort. It will blindly execute whatever it was told to do. -2. **Compiler limitations**: "*But isn't it what compilers are supposed to do? Why don't compilers solve all our problems?*" Indeed, compilers are amazingly smart nowadays, but can still generate suboptimal code. Compilers are great at eliminating redundant work, but when it comes to making more complex decisions like vectorization, etc., they may not generate the best possible code. Performance experts often can come up with a clever way to vectorize a loop, which would be extremely hard for a traditional compiler. When compilers have to make a decision whether to perform a code transformation or not, they rely on complex cost models and heuristics, which may not work for every possible scenario. For example, there is no binary "yes" or "no" answer to the question of whether a compiler should always inline a function into the place where it's called. It usually depends on many factors which a compiler should take into account. Additionally, compilers cannot perform optimizations unless they are certain it is safe to do so, and it does not affect the correctness of the resulting machine code. It may be very difficult for compiler developers to ensure that a particular optimization will generate correct code under all possible circumstances, so they often have to be conservative and refrain from doing some optimizations. Finally, compilers generally do not attempt "heroic" optimizations, like transforming data structures used by a program. +2. **Compiler limitations**: "*But isn't it what compilers are supposed to do? Why don't compilers solve all our problems?*" Indeed, compilers are amazingly smart nowadays, but can still generate suboptimal code. Compilers are great at eliminating redundant work, but when it comes to making more complex decisions like vectorization, they may not generate the best possible code. Performance experts often can come up with a clever way to vectorize a loop, which would be extremely hard for a traditional compiler. When compilers have to make a decision whether to perform a code transformation or not, they rely on complex cost models and heuristics, which may not work for every possible scenario. For example, there is no binary "yes" or "no" answer to the question of whether a compiler should always inline a function into the place where it's called. It usually depends on many factors which a compiler should take into account. Additionally, compilers cannot perform optimizations unless they are certain it is safe to do so, and it does not affect the correctness of the resulting machine code. It may be very difficult for a compiler to prove that an optimization will generate correct code under all possible circumstances, so they often have to be conservative and refrain from doing some optimizations. Finally, compilers generally do not attempt "heroic" optimizations, like transforming data structures used by a program. 3. **Algorithmic complexity analysis limitations**: some developers are overly obsessed with algorithmic complexity analysis, which leads them to choose a popular algorithm with the optimal algorithmic complexity, even though it may not be the most efficient for a given problem. Considering two sorting algorithms, InsertionSort and QuickSort, the latter clearly wins in terms of Big O notation for the average case: InsertionSort is O(N^2^) while QuickSort is only O(N log N). Yet for relatively small sizes of `N` (up to 50 elements), InsertionSort outperforms QuickSort. Complexity analysis cannot account for all the branch prediction and caching effects of various algorithms, so people just encapsulate them in an implicit constant `C`, which sometimes can make a drastic impact on performance. Blindly trusting Big O notation without testing on the target workload could lead developers down an incorrect path. So, the best-known algorithm for a certain problem is not necessarily the most performant in practice for every possible input. In addition to the limitations described above, there are overheads created by programming paradigms. Coding practices that prioritize code clarity, readability, and maintainability, often come at the potential performance cost. Highly generalized and reusable code can introduce unnecessary copies, runtime checks, function calls, memory allocations, and so on. For instance, polymorphism in object-oriented programming is implemented using virtual functions, which introduce a performance overhead.[^1] diff --git a/chapters/1-Introduction/1-2 Why Care about Performance.md b/chapters/1-Introduction/1-2 Why Care about Performance.md index 87efd4715d..4f09cd3e68 100644 --- a/chapters/1-Introduction/1-2 Why Care about Performance.md +++ b/chapters/1-Introduction/1-2 Why Care about Performance.md @@ -1,6 +1,6 @@ ## Why Care about Performance? -In addition to the slowing trend of hardware single-threaded performance growth, there are a couple of other business reasons to care about performance. During the PC era,[^3] the costs of slow software were paid by the users, as inefficient software was running on user computers. With the advent of SaaS (software as a service) and cloud computing, the costs of slow software are put back on the software providers, not their users. If you're a SaaS company like Meta or Netflix,[^4] it doesn't matter if you run your service on-premise hardware or you use public cloud, you pay for the electricity your servers consume. Inefficient software cuts right into your margins and market evaluation. According to Synergy Research Group,[^5] worldwide spending on cloud services topped $100 billion in 2020, and according to Gartner,[^6] it will surpass $675 billion in 2024. +In addition to the slowing trend of hardware single-threaded performance growth, there are a couple of other business reasons to care about performance. During the PC era,[^12] the costs of slow software were paid by the users, as inefficient software was running on user computers. With the advent of SaaS (software as a service) and cloud computing, the costs of slow software are put back on the software providers, not their users. If you're a SaaS company like Meta or Netflix,[^4] it doesn't matter if you run your service on-premise hardware or you use public cloud, you pay for the electricity your servers consume. Inefficient software cuts right into your margins and market evaluation. According to Synergy Research Group,[^5] worldwide spending on cloud services topped $100 billion in 2020, and according to Gartner,[^6] it will surpass $675 billion in 2024. For many years performance engineering was a nerdy niche, but now it's becoming mainstream. Many companies have already realized the importance of performance engineering and are willing to pay well for this work. @@ -12,13 +12,14 @@ The impact of small improvements is very relevant for large distributed applicat > "At such [Google] scale, understanding performance characteristics becomes critical – even small improvements in performance or utilization can translate into immense cost savings." [@GoogleProfiling] -In addition to cloud costs, there is another factor at play: how people perceive slow software. For instance, Google reported that a 2% slower search caused [2% fewer searches](https://assets.en.oreilly.com/1/event/29/Keynote Presentation 2.pdf) per user.[^7] For Yahoo! 400 milliseconds faster page load caused [5-9% more traffic](https://www.slideshare.net/stoyan/dont-make-me-wait-or-building-highperformance-web-applications).[^8] In the game of big numbers, small improvements can make a significant impact. Such examples prove that the slower the service works, the fewer people will use it. +In addition to cloud costs, there is another factor at play: how people perceive slow software. For instance, Google reported that a 2% slower search caused [2% fewer searches](https://assets.en.oreilly.com/1/event/29/Keynote Presentation 2.pdf) per user.[^3] For Yahoo! 400 milliseconds faster page load caused [5-9% more traffic](https://www.slideshare.net/stoyan/dont-make-me-wait-or-building-highperformance-web-applications).[^8] In the game of big numbers, small improvements can make a significant impact. Such examples prove that the slower the service works, the fewer people will use it. Outside cloud services, there are many other performance-critical industries where performance engineering does not need to be justified, such as Artificial Intelligence (AI), High-Performance Computing (HPC), High-Frequency Trading (HFT), Game Development, etc. Moreover, performance is not only required in highly specialized areas, it is also relevant for general-purpose applications and services. Many tools that we use every day simply would not exist if they failed to meet their performance requirements. For example, Visual C++ [IntelliSense](https://docs.microsoft.com/en-us/visualstudio/ide/visual-cpp-intellisense)[^2] features that are integrated into Microsoft Visual Studio IDE have very tight performance constraints. For the IntelliSense autocomplete feature to work, they have to parse the entire source codebase in the order of milliseconds.[^9] Nobody will use a source code editor if it takes several seconds to suggest autocomplete options. Such a feature has to be very responsive and provide valid continuations as the user types new code. > "Not all fast software is world-class, but all world-class software is fast. Performance is _the_ killer feature." - Tobi Lutke, CEO of Shopify. -I hope it goes without saying that people hate using slow software. Especially when their productivity goes down because of it. Table @tbl:WindowsResponsiveness shows that most people (including myself) consider a delay of 2 seconds or more to be a "long wait". In fact, I would probably switch to something else after 5 seconds of waiting. +I hope it goes without saying that people hate using slow software. Especially when their productivity goes down because of it. Table 2 shows that most people consider a delay of 2 seconds or more to be a "long wait", and would switch to something else after 10 seconds of waiting (I think much sooner). If you want to keep user's attention, you've better have your application to react quickly. + Performance characteristics of an application can be a single factor for your customer to switch to a competitor's product. By putting emphasis on performance, you can give your product a competitive advantage. Sometimes fast tools find use in the areas they were not initially designed for. For example, nowadays, game engines like Unreal and Unity are used in architecture, 3d visualization, filmmaking, and other areas. Because game engines are so performant, they are a natural choice for applications that require 2d and 3d rendering, physics simulation, collision detection, sound, animation, etc. @@ -44,15 +45,15 @@ Long-running User will probably switch away during operation 10 sec - 30 sec ------------------------------------------------------------------------------ -Table: Human-software interaction classes. Image from Microsoft Windows Blogs[^11]. {#tbl:WindowsResponsiveness} +Table: Human-software interaction classes. *Source: Microsoft Windows Blogs*.[^11] {#tbl:WindowsResponsiveness} > “Fast tools don’t just allow users to accomplish tasks faster; they allow users to accomplish entirely new types of tasks, in entirely new ways.” - Nelson Elhage wrote in an [article](https://blog.nelhage.com/post/reflections-on-performance/)[^1]on his blog. Before starting performance-related work, make sure you have a strong reason to do so. Optimization just for optimization’s sake is useless if it doesn’t add value to your product.[^10] Mindful performance engineering starts with clearly defined performance goals, stating what you are trying to achieve and why you are doing it. Also, you should pick the metrics that you will use to measure whether you reach the goal or not. -Now that we've talked about the value of performance engineering, let's uncover what it consists of. When you trying to improve the performance of a program, you need to find what to improve (performance analysis) and then improve it (tuning), which is very similar to a regular debugging activity. This is what we will discuss next. +Now that we've talked about the value of performance engineering, let's uncover what it consists of. When you're trying to improve the performance of a program, you need to find what to improve (performance analysis) and then improve it (tuning), which is very similar to a regular debugging activity. This is what we will discuss next. -[^3]: From the late 1990s to the late 2000s where personal computers dominated the market of computing devices. +[^12]: From the late 1990s to the late 2000s where personal computers dominated the market of computing devices. [^4]: In 2024, Meta uses mostly on-premise cloud, while Netflix uses AWS public cloud. [^5]: Worldwide spending on cloud services in 2020 - [https://www.srgresearch.com/articles/2020-the-year-that-cloud-service-revenues-finally-dwarfed-enterprise-spending-on-data-centers](https://www.srgresearch.com/articles/2020-the-year-that-cloud-service-revenues-finally-dwarfed-enterprise-spending-on-data-centers) [^6]: Worldwide spending on cloud services in 2024 - [https://www.gartner.com/en/newsroom/press-releases/2024-05-20-gartner-forecasts-worldwide-public-cloud-end-user-spending-to-surpass-675-billion-in-2024](https://www.gartner.com/en/newsroom/press-releases/2024-05-20-gartner-forecasts-worldwide-public-cloud-end-user-spending-to-surpass-675-billion-in-2024) diff --git a/chapters/1-Introduction/1-4 What is performance analysis.md b/chapters/1-Introduction/1-4 What is performance analysis.md index 20b621fb0f..03a831654d 100644 --- a/chapters/1-Introduction/1-4 What is performance analysis.md +++ b/chapters/1-Introduction/1-4 What is performance analysis.md @@ -6,6 +6,6 @@ Inexperienced developers sometimes make changes in their code and claim it *shou Many micro-optimization tricks that circulate around the world were valid in the past, but current compilers have already learned them. Additionally, some people tend to overuse legacy bit-twiddling tricks. One such example is using [XOR-based swap idiom](https://en.wikipedia.org/wiki/XOR_swap_algorithm),[^2] while in reality, simple `std::swap` produces faster code. Such accidental changes likely won’t improve the performance of an application. Finding the right place to fix should be a result of careful performance analysis, not intuition or guessing. -There are many performance analysis methodologies, and not each of them will necessarily lead you to a discovery. With experience, you will develop your own strategy about when to use each approach. The methodologies presented in this book are based on collecting information about how a program executes. Any change that ends up being made in the source code of a program should be driven by analyzing and interpreting collected data. We will show you how to use these techniques to discover opportunities for performance improvements even in a large and unfamiliar codebase. +Performance analysis is a process of collecting information about how a program executes and interpreting it to find optimization opportunities. Any change that ends up being made in the source code of a program should be driven by analyzing and interpreting collected data. We will show you how to use performance analysis techniques to discover optimization opportunities even in a large and unfamiliar codebase. There are many performance analysis methodologies, however, not each one of them will necessarily lead you to a discovery. With experience, you will develop your own strategy about when to use each approach. [^2]: XOR-based swap idiom - [https://en.wikipedia.org/wiki/XOR_swap_algorithm](https://en.wikipedia.org/wiki/XOR_swap_algorithm) diff --git a/chapters/1-Introduction/1-5 What is performance tuning.md b/chapters/1-Introduction/1-5 What is performance tuning.md index b520f0407d..f848f1194f 100644 --- a/chapters/1-Introduction/1-5 What is performance tuning.md +++ b/chapters/1-Introduction/1-5 What is performance tuning.md @@ -4,14 +4,12 @@ Locating a performance bottleneck is only half of an engineer’s job. The secon To take advantage of all the computing power of modern CPUs, you need to understand how they work. Or as performance engineers like to say, you need to have "mechanical sympathy". This term was borrowed from the car racing world. It means that a racing driver with a good understanding of how the car works has an edge over its competitors who don't. The same applies to performance engineering. It is not possible to know all the details of how a modern CPU operates, but you need to have a good mental model of it to squeeze the last bit of performance. -This is what I mean by "low-level optimizations". This is a type of optimization that takes into account the details of the underlying hardware capabilities. It is different from "high-level optimizations" which are more about application-level logic, algorithms, and data structures. - -In the past, software developers had more mechanical sympathy as they often had to deal with nuances of the hardware implementation. During the PC era, developers usually were programming directly on top of the operating system, with possibly a few libraries in between. As the world moved to the cloud era, the software stack got deeper and more complex. The top layer of the stack (on which most developers work) has moved further away from the hardware. The negative side of such evolution is that developers of modern applications have less affinity to the actual hardware on which their software is running. - -Having a good understanding of the underlying hardware is a requirement for making low-level optimizations. As you will see in the book, the majority of low-level optimizations can be applied to a wide variety of modern processors. +This is what I mean by "low-level optimizations". This is a type of optimization that takes into account the details of the underlying hardware capabilities. It is different from "high-level optimizations" which are more about application-level logic, algorithms, and data structures. As you will see in the book, the majority of low-level optimizations can be applied to a wide variety of modern processors. To successfully implement low-level optimizations, you need to have a good understanding of the underlying hardware. > "During the post-Moore era, it will become ever more important to make code run fast and, in particular, to tailor it to the hardware on which it runs." [@Leisersoneaam9744] -There is a famous quote by Donald Knuth: "Premature optimization is the root of all evil". But the opposite is often true as well. Postponed performance engineering work may be too late and cause as much evil as premature optimization. For developers working with performance-critical projects, it is crucial to know how underlying hardware works. In such industries, it is a failure from the start when a program is being developed without a hardware focus. ClickHouse DB is an example of a successful software product that was built around a small but very efficient kernel. Performance characteristics of software must be a first-class citizen along with correctness and security starting from day 1. Poor performance can kill a product just as easily as security vulnerabilities. +In the past, software developers had more mechanical sympathy as they often had to deal with nuances of the hardware implementation. During the PC era, developers usually were programming directly on top of the operating system, with possibly a few libraries in between. As the world moved to the cloud era, the software stack got deeper and more complex. The top layer of the stack (on which most developers work) has moved further away from the hardware. The negative side of such evolution is that developers of modern applications have less affinity to the actual hardware on which their software is running. This book will help you build a strong connection with modern processors. + +There is a famous quote by Donald Knuth: "Premature optimization is the root of all evil". But the opposite is often true as well. Postponed performance engineering may be too late and cause as much evil as premature optimization. For developers working with performance-critical projects, it is crucial to know how underlying hardware works. In such industries, it is a failure from the start when a program is being developed without a hardware focus. ClickHouse DB is an example of a successful software product that was built around a small but very efficient core. Performance characteristics of software must be a first-class citizen along with correctness and security starting from day 1. Poor performance can kill a product just as easily as security vulnerabilities. -Performance engineering is important and rewarding work, but it may be very time-consuming. In fact, performance optimization is a never-ending game. There will always be something to optimize. Inevitably, the developer will reach the point of diminishing returns at which further improvement comes at a very high engineering cost and likely will not be worth the effort. Knowing when to stop optimizing is a critical aspect of performance work. +Performance engineering is important and rewarding work, but it may be very time-consuming. In fact, performance optimization is a never-ending game. There will always be something to optimize. Inevitably, a developer will reach the point of diminishing returns at which further improvement comes at a very high engineering cost and likely will not be worth the effort. Knowing when to stop optimizing is a critical aspect of performance work. diff --git a/chapters/1-Introduction/1-6 What is in the book.md b/chapters/1-Introduction/1-6 What is in the book.md index 75649e7837..f72f7586f8 100644 --- a/chapters/1-Introduction/1-6 What is in the book.md +++ b/chapters/1-Introduction/1-6 What is in the book.md @@ -11,7 +11,7 @@ This book is written to help developers better understand the performance of the Hopefully, by the end of this book, you will be able to answer those questions. -The book is split into two parts: performance analysis and performance optimization. The first part (chapters 2-7) teaches you how to find performance problems, and the second part (chapters 8-13) teaches you how to fix them. Here is the outline of the book chapters: +The book is split into two parts. The first part (chapters 2-7) teaches you how to find performance problems, and the second part (chapters 8-13) teaches you how to fix them. Here is the outline of the book chapters: * Chapter 1 is an introduction that you're reading right now. * Chapter 2 discusses how to conduct fair performance experiments and analyze their results. It introduces the best practices for performance testing and comparing results. @@ -27,7 +27,7 @@ The book is split into two parts: performance analysis and performance optimizat * Chapter 12 contains optimization topics not specifically related to any of the categories covered in the previous four chapters, but are still important enough to find their place in this book. In this chapter, we will discuss CPU-specific optimizations, examine several microarchitecture-related performance problems, explore techniques used for optimizing low-latency applications, and give you advice on tuning your system for the best performance. * Chapter 13 discusses techniques for analyzing multithreaded applications. It digs into some of the most important challenges of optimizing the performance of multithreaded applications. We provide a case study of five real-world multithreaded applications, where we explain why their performance doesn't scale with the increasing number of CPU threads. We also discuss cache coherency issues, such as "False Sharing" and a few tools that are designed to analyze multithreaded applications. -Examples provided in this book are primarily based on open-source software: Linux as the operating system, the LLVM-based Clang compiler for C and C++ languages, and various open-source applications and benchmarks[^1] that you can build and run. The reason is not only the popularity of the projects but also the fact that their source code is open, which enables us to better understand the underlying mechanism of how they work. This is especially useful for learning the concepts presented in this book. This doesn't mean that we will never showcase proprietary tools. For example, we extensively use Intel® VTune™ Profiler. +Examples provided in this book are primarily based on open-source software: Linux as the operating system, the LLVM-based Clang compiler for C and C++ languages, and various open-source applications and benchmarks[^1] that you can build and run. The reason is not only the popularity of these projects but also the fact that their source code is open, which enables us to better understand the underlying mechanism of how they work. This is especially useful for learning the concepts presented in this book. This doesn't mean that we will never showcase proprietary tools. For example, we extensively use Intel® VTune™ Profiler. Prior compiler experience helps a lot in performance-related work. Sometimes it's possible to obtain attractive speedups by forcing the compiler to generate desired machine code through various hints. You will find many such examples throughout the book. Luckily, most of the time, you don't have to be a compiler expert to drive performance improvements in your application. The majority of optimizations can be done at a source code level without the need to dig down into compiler sources. diff --git a/chapters/1-Introduction/1-7 What is not in this book.md b/chapters/1-Introduction/1-7 What is not in this book.md index 22d9af5be9..10ed61857c 100644 --- a/chapters/1-Introduction/1-7 What is not in this book.md +++ b/chapters/1-Introduction/1-7 What is not in this book.md @@ -6,9 +6,9 @@ Likewise, the software stack includes many layers, e.g., firmware, BIOS, OS, lib The scope of the book does not go beyond a single CPU socket, so we will not discuss optimization techniques for distributed, NUMA, and heterogeneous systems. Offloading computations to accelerators (GPU, FPGA, etc.) using solutions like OpenCL and openMP is not discussed in this book. -I tried to make this book to be applicable to most modern CPUs, including Intel, AMD, Apple, and other ARM-based processors. I'm very sorry if it doesn't cover your favorite architecture. Nevertheless, many of the principles discussed in this book apply well to other processors. Similarly, most examples in this book were run on Linux, but again, most of the time it doesn't matter since the same techniques benefit applications that run on Windows and macOS operating systems. +I tried to make this book to be applicable to most modern CPUs, including Intel, AMD, Apple, and other ARM-based processors. I'm sorry if it doesn't cover your favorite architecture. Nevertheless, many of the principles discussed in this book apply well to other processors. Similarly, most examples in this book were run on Linux, but again, most of the time it doesn't matter since the same techniques benefit applications that run on Windows and macOS operating systems. -Code snippets in this book are written in C, C++, or x86 assembly languages, but to a large degree, ideas from this book can be applied to other languages that are compiled to native code like Rust, Go, and even Fortran. Since this book targets user-mode applications that run close to the hardware, we will not discuss managed environments, e.g., Java. +Code snippets in this book are written in C, or C++, but to a large degree, ideas from this book can be applied to other languages that are compiled to native code like Rust, Go, and even Fortran. Since this book targets user-mode applications that run close to the hardware, we will not discuss managed environments, e.g., Java. -Finally, the author assumes that readers have full control over the software that they develop, including the choice of libraries and compilers they use. Hence, this book is not about tuning purchased commercial packages, e.g., tuning SQL database queries. +Finally, I assume that readers have full control over the software that they develop, including the choice of libraries and compilers they use. Hence, this book is not about tuning purchased commercial packages, e.g., tuning SQL database queries. diff --git a/chapters/1-Introduction/1-8 Exercises.md b/chapters/1-Introduction/1-8 Exercises.md index 67be2eb4cd..1da96cd5c8 100644 --- a/chapters/1-Introduction/1-8 Exercises.md +++ b/chapters/1-Introduction/1-8 Exercises.md @@ -2,8 +2,8 @@ \markright{Exercises} -As supplemental material for this book, we developed "Performance Ninja", a free online course where you can practice low-level performance analysis and tuning. It is available at the following URL: [https://github.com/dendibakh/perf-ninja](https://github.com/dendibakh/perf-ninja). It has a collection of lab assignments that focus on a specific performance problem. Each lab assignment can take anywhere from 30 minutes up to 4 hours depending on your background and the complexity of the lab assignment itself. +As supplemental material for this book, I developed "Performance Ninja", a free online course where you can practice low-level performance analysis and tuning. It is available at the following URL: [https://github.com/dendibakh/perf-ninja](https://github.com/dendibakh/perf-ninja). It has a collection of lab assignments that focus on a specific performance problem. Each lab assignment can take anywhere from 30 minutes up to 4 hours depending on your background and the complexity of the lab assignment itself. -Following the name of the GitHub repository, we will use `perf-ninja` to refer to the online course. In the "Questions and Exercises" section at the end of each chapter, you may find assignments from `perf-ninja`. For example, when you see `perf-ninja::warmup`, this corresponds to the lab assignment that is located in the `labs/misc/warmup` folder in the aforementioned repository. We encourage you to solve these puzzles to solidify your knowledge. +Following the name of the GitHub repository, we will use `perf-ninja` to refer to the online course. In the "Questions and Exercises" section at the end of each chapter, you may find assignments from `perf-ninja`. For example, when you see `perf-ninja::warmup`, this corresponds to the lab assignment with the name "Warmup" in the GitHub repository. We encourage you to solve these puzzles to solidify your knowledge. -You can solve assignments on your local machine or submit your code changes to GitHub for automated verification and benchmarking. If you choose the latter, follow the instructions on the "Get Started" page of the repo. We also use examples from `perf-ninja` throughout the book. This enables you to reproduce a specific performance problem on your own machine and experiment with it. +You can solve assignments on your local machine or submit your code changes to GitHub for automated verification and benchmarking. If you choose the latter, follow the instructions on the "Get Started" page of the repository. We also use examples from `perf-ninja` throughout the book. This enables you to reproduce a specific performance problem on your own machine and experiment with it.