diff --git a/chapters/10-Optimizing-Branch-Prediction/10-2 Replace branches with predication.md b/chapters/10-Optimizing-Branch-Prediction/10-2 Replace branches with predication.md index 59ef7ba365..75ed09e78b 100644 --- a/chapters/10-Optimizing-Branch-Prediction/10-2 Replace branches with predication.md +++ b/chapters/10-Optimizing-Branch-Prediction/10-2 Replace branches with predication.md @@ -2,7 +2,7 @@ typora-root-url: ..\..\img --- -## Replace Branches with Predication +## Replace Branches with Predication {#sec:BranchlessPredication} Some branches could be effectively eliminated by executing both parts of the branch and then selecting the right result (predication). Example[^1] of code when such transformation might be profitable is shown on [@lst:PredicatingBranches1]. If TMA suggests that the `if (cond)` branch has a very high number of mispredictions, one can try to eliminate the branch by doing the transformation shown on [@lst:PredicatingBranches2]. diff --git a/chapters/11-Machine-Code-Layout-Optimizations/11-7 PGO.md b/chapters/11-Machine-Code-Layout-Optimizations/11-7 PGO.md index 1eda636faf..fd5ca45182 100644 --- a/chapters/11-Machine-Code-Layout-Optimizations/11-7 PGO.md +++ b/chapters/11-Machine-Code-Layout-Optimizations/11-7 PGO.md @@ -4,24 +4,30 @@ typora-root-url: ..\..\img ## Profile Guided Optimizations {#sec:secPGO} -[TODO] Elaborate more on post-linker optimizers +Compiling a program and generating optimal assembly listing is all about heuristics. Code transformation algorithms have many corner cases that aim for optimal performance in specific situations. For a lot of decisions that a compiler makes, it tries to guess the best choice based on some typical cases. For example, when deciding whether a particular function should be inlined, the compiler could take into account the number of times this function will be called. The problem is that compiler doesn't know that beforehand. It first needs to run the program to find out. Without any runtime information, the compiler will have to make a guess. -Compiling a program and generating optimal assembly listing is all about heuristics. Code transformation algorithms have many corner cases that aim for optimal performance in specific situations. For a lot of decisions that a compiler makes, it tries to guess the best choice based on some typical cases. For example, when deciding whether a particular function should be inlined, the compiler could take into account the number of times this function will be called. The problem is that compiler doesn't know that beforehand. +Here is when profiling information becomes handy. Given profiling information, compiler can make better optimization decisions. There is a set of transformations in most compilers that can adjust their algorithms based on profiling data fed back to them. This set of transformations is called Profile Guided Optimizations (PGO). When profiling data becomes available, a compiler will start using it. Otherwise, it will fall back to using its standard algorithms and heuristics. Sometimes in literature, you can find the term Feedback Directed Optimizations (FDO), which essentially refers to the same thing as PGO. -Here is when profiling information becomes handy. Given profiling information compiler can make better optimization decisions. There is a set of transformations in most compilers that can adjust their algorithms based on profiling data fed back to them. This set of transformations is called Profile Guided Optimizations (PGO). Sometimes in literature, one can find the term Feedback Directed Optimizations (FDO), which essentially refers to the same thing as PGO. Often times a compiler will rely on profiling data when it is available. Otherwise, it will fall back to using its standard algorithms and heuristics. +Figure @fig:PGO_flow shows a traditional workflow of using PGO, also called *instrumented PGO*. First, you compile your program and tell the compiler to automatically instrument the code. This will insert some bookkeeping code in every function and every basic block to collecting runtime statistics. Second step is to run the instrumented binary with an input data that represents a typical workload for your application. This will generate the profiling data, a new file with runtime statistics. It is a raw dump file with information about function call counts, loop iteration counts, and other basic block hit counts. The final step in this workflow is to recompile the program with the profiling data to produce optimized executable. -It is not uncommon to see real workloads performance increase by up to 15% from using Profile Guided Optimizations. PGO does not only improve inlining and code placement but also improves register allocation[^6]and more. +![Instrumented PGO workflow.](../../img/cpu_fe_opts/pgo_flow.png){#fig:PGO_flow width=90% } -Profiling data can be generated based on two different ways: code instrumentation (see [@sec:secInstrumentation]) and sample-based profiling (see [@sec:profiling]). Both are relatively easy to use and have associated benefits and drawbacks discussed in [@sec:secApproachesSummary]. +Developers can enable PGO instrumentation (step 1) in the LLVM compiler by building the program with the `-fprofile-instr-generate` option. This will instruct the compiler to instrument the code, which will collect profiling information at runtime. After that, the LLVM compiler can consume profiling data with the `-fprofile-instr-use` option to recompile the program and output a PGO-tuned binary. The guide for using PGO in clang is described in the [documentation](https://clang.llvm.org/docs/UsersManual.html#profiling-with-instrumentation)[^7]. GCC compiler uses different set of options: `-fprofile-generate` and `-fprofile-use` as described in the [documentation](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options). -The first method can be utilized in the LLVM compiler by building the program with the `-fprofile-instr-generate` option. This will instruct the compiler to instrument the code, which will collect profiling information at runtime. After that, the LLVM compiler can consume profiling data with the `-fprofile-instr-use` option to recompile the program and output a PGO-tuned binary. The guide for using PGO in clang is described in the [documentation](https://clang.llvm.org/docs/UsersManual.html#profiling-with-instrumentation)[^7]. GCC compiler uses different set of options `-fprofile-generate` and `-fprofile-use` as described in GCC [documentation](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options). +PGO helps compiler to improve function inlining, code placement, register allocation, and a few more code transformations. PGO is mainly used in projects with a large codebase, for example, Linux kernel, compilers, databases, web browsers, video games, productivity tools and others. For applications with millions lines of code, it is the only practical way to improve machine code layout. It is not uncommon to see performance of production workloads increase by 10-25% from using Profile Guided Optimizations. -The second method, which is generating profiling data input for the compiler based on sampling, can be utilized thanks to [AutoFDO](https://github.com/google/autofdo)[^8] tool, which converts sampling data generated by Linux `perf` into a format that compilers like GCC and LLVM can understand. [@AutoFDO] +While many software projects adopted instrumented PGO as a part of their build process, the rate of adoption is still very low. There are a few reasons for that. The primary reason is a huge runtime overhead of instrumented executables. Running an instrumented binary and collecting profiling data frequently incurs 5-10x slowdown, which makes the build step longer and prevents organizations run collection on production systems, whether on client devices or in the cloud. Unfortunately, you cannot collect the profiling data once and use it for all the future builds. As the source code of an application evolves, the profile data becomes stale and needs to be recollected. -Keep in mind that the compiler "blindly" uses the profile data you provided. The compiler assumes that all the workloads will behave the same, so it optimizes your app just for that single workload. Users of PGO should be careful about choosing the workload to profile because while improving one use case of the application, other might be pessimized. Luckily, it doesn't have to be exactly a single workload since profile data from different workloads can be merged together to represent a set of use cases for the application. +Another caveat in the PGO flow is that a compiler should only be trained using the most frequent scenarios of how your application will be used. Otherwise, you may end up degrading program's preformance. The compiler "blindly" uses the profile data that you provided. It assumes that the program will always behave the same no matter what the input data is. Users of PGO should be careful about choosing the input data they will use for collecting profiling data (step 2) because while improving one use case of the application, others may be pessimized. Luckily, it doesn't have to be exactly a single workload since profile data from different workloads can be merged together to represent a set of use cases for the application. -In the mid-2018, Facebook open-sourced its binary relinker tool called [BOLT](https://code.fb.com/data-infrastructure/accelerate-large-scale-applications-with-bolt/). BOLT works on the already compiled binary. It first disassembles the code, then it uses profile information to do various layout transformations (including basic blocks reordering, function splitting, and grouping) and generates optimized binary [@BOLT]. A similar tool was developed at Google called [Propeller](https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf), which serves a similar purpose as BOLT but claim certain advantages over it. It is possible to integrate optimizing relinker into the build system and enjoy an extra 5-10% performance speedup from the optimized code layout. The only thing one needs to worry about is to have a representative and meaningful workload for collecting profiling information. +An alternative solution was pioneered by Google in 2016 with sampled-based PGO. [@AutoFDO] Instead of instrumenting the code, the profiling data can be obtained right from the output of a standard profiling tool such as Linux `perf`. Google developed an open-source tool called [AutoFDO](https://github.com/google/autofdo)[^8] that converts sampling data generated by Linux `perf` into a format that compilers like GCC and LLVM can understand. -[^6]: because with PGO compiler can put all the hot variables into registers, etc. -[^7]: PGO in Clang - [https://clang.llvm.org/docs/UsersManual.html#profiling-with-instrumentation](https://clang.llvm.org/docs/UsersManual.html#profiling-with-instrumentation). -[^8]: AutoFDO - [https://github.com/google/autofdo](https://github.com/google/autofdo). +This approach has a few advantages over instrumented PGO. First of all, it eliminates one step from the PGO build procedure, namely step 1 since there is no need to build an instrumented binary. Secondly, profiling data collection runs on an already optimized binary, thus it has a much lower runtime overhead. This makes it possible to collect profiling data in production environment for a longer period of time. Since this approach is based on HW collection, it also enables new kinds of optimizations that are not possible with instrumented PGO. For example, branch-to-cmov conversion, which is a transformation that replaces conditional jumps with conditional moves (see [@sec:BranchlessPredication]). To effectively perform branch-to-cmov transformation, a compiler needs to know how frequently a branch was mispredicted. This information becomes available with sampled-based PGO. + +The next innovative idea came from Meta in the mid-2018, when it open-sourced its binary relinker tool called [BOLT](https://code.fb.com/data-infrastructure/accelerate-large-scale-applications-with-bolt/).[^9] BOLT works on the already compiled binary. It first disassembles the code, then it uses the profile information collected by sampling profiler, such as Linux perf, to do various layout transformations and then relinks the binary again. [@BOLT] As of today, BOLT has more than 15 optimization passes, including basic blocks reordering, function grouping and splitting, branch-to-cmov conversion, and others. Similar to traditional PGO, primary candidates for BOLT optimizations are programs that suffer from many instruction cache and iTLB misses. Since 2023, BOLT is a part of the LLVM project and is available as a standalone tool. + +A few years after BOLT was introduced, Google open-sourced its own binary relinker tool called [Propeller](https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf), which serves a similar purpose but uses much less memory and provides better scaling. Post-link optimizers such as BOLT and Propeller can be used in combination with traditional PGO and often provide additional 5-10% performance speedup. Such techniques open new kinds of binary rewriting optimizations that are based on HW telemetry. + +[^7]: PGO in Clang - [https://clang.llvm.org/docs/UsersManual.html#profiling-with-instrumentation](https://clang.llvm.org/docs/UsersManual.html#profiling-with-instrumentation) +[^8]: AutoFDO - [https://github.com/google/autofdo](https://github.com/google/autofdo) +[^9]: BOLT - [https://code.fb.com/data-infrastructure/accelerate-large-scale-applications-with-bolt/](https://code.fb.com/data-infrastructure/accelerate-large-scale-applications-with-bolt/) diff --git a/img/cpu_fe_opts/pgo_flow.png b/img/cpu_fe_opts/pgo_flow.png new file mode 100644 index 0000000000..cec96810a5 Binary files /dev/null and b/img/cpu_fe_opts/pgo_flow.png differ