[Chapter10] Predication -> Selection.

dendibakh · Sep 17, 2024 · fc3db0c · fc3db0c
1 parent f733ccc
commit fc3db0c
Show file tree

Hide file tree

Showing 4 changed files with 12 additions and 14 deletions.
diff --git a/chapters/10-Optimizing-Branch-Prediction/10-0 Optimizing bad speculation.md b/chapters/10-Optimizing-Branch-Prediction/10-0 Optimizing bad speculation.md
@@ -20,7 +20,7 @@ In the past, developers had an option of providing a prediction hint to an x86 p
 
 One indirect way to reduce branch mispredictions is to straighten the code using source-based and compiler-based techniques. PGO and BOLT are effective at reducing branch mispredictions thanks to improving fallthrough rate that alleviates the pressure on branch predictor structures. We will discuss those techniques in the next chapter.
 
-So perhaps the only direct way to get rid of branch mispredictions is to get rid of the branch itself. In the two subsequent sections, we will take a look at how branches can be replaced with lookup tables and predication.
+So perhaps the only direct way to get rid of branch mispredictions is to get rid of the branch itself. In the two subsequent sections, we will take a look at how branches can be replaced with lookup tables and selection.
 
 There is a conventional wisdom that never-taken branches are transparent to the branch prediction and can't affect performance, and therefore it doesn't make much sense to remove them, at least from a prediction perspective. However, contrary to the wisdom, an experiment conducted by authors of BOLT optimizer demonstrated that replacing never-taken branches with equal-sized no-ops in a large code footprint application, such as Clang C++ compiler, leads to approximately 5\% speedup on modern Intel CPUs. So it still pays to try to eliminate all branches.
 

diff --git a/chapters/10-Optimizing-Branch-Prediction/10-3 Replace branches with predication.md b/chapters/10-Optimizing-Branch-Prediction/10-3 Replace branches with predication.md
@@ -1,12 +1,10 @@
-## Replace Branches with Predication {#sec:BranchlessPredication}
+## Replace Branches with Selection {#sec:BranchlessSelection}
 
-[TODO]: Replace the word "predication" with "selection"
+Some branches could be effectively eliminated by executing both parts of the branch and then selecting the right result. An example of code when such transformation might be profitable is shown in [@lst:ReplaceBranchesWithSelection]. If TMA suggests that the `if (cond)` branch has a very high number of mispredictions, you can try to eliminate the branch by doing the transformation shown on the right.
 
-Some branches could be effectively eliminated by executing both parts of the branch and then selecting the right result (*predication*). An example of code when such transformation might be profitable is shown in [@lst:PredicatingBranchesCode]. If TMA suggests that the `if (cond)` branch has a very high number of mispredictions, you can try to eliminate the branch by doing the transformation shown on the right.
+Listing: Replacing Branches with Selection.
 
-Listing: Predicating branches.
-
-~~~~ {#lst:PredicatingBranchesCode .cpp}
+~~~~ {#lst:ReplaceBranchesWithSelection .cpp}
 int a;                                             int x = computeX();
 if (cond) { /* frequently mispredicted */   =>     int y = computeY();
   a = computeX();                                  int a = cond ? x : y;
@@ -17,13 +15,13 @@ if (cond) { /* frequently mispredicted */   =>     int y = computeY();
 
 [TODO]: Add continuation `foo(a);` in the code example.
 
-For the code on the right, the compiler can replace the branch that comes from the ternary operator, and generate a `CMOV` x86 instruction instead. A `CMOVcc` instruction checks the state of one or more of the status flags in the `EFLAGS` register (`CF, OF, PF, SF` and `ZF`) and performs a move operation if the flags are in a specified state or condition. A similar transformation can be done for floating-point numbers with `FCMOVcc,VMAXSS/VMINSS` instructions. [@lst:PredicatingBranchesAsm] shows assembly listings for the original and the branchless version.
+For the code on the right, the compiler can replace the branch that comes from the ternary operator, and generate a `CMOV` x86 instruction instead. A `CMOVcc` instruction checks the state of one or more of the status flags in the `EFLAGS` register (`CF, OF, PF, SF` and `ZF`) and performs a move operation if the flags are in a specified state or condition. A similar transformation can be done for floating-point numbers with `FCMOVcc,VMAXSS/VMINSS` instructions. [@lst:ReplaceBranchesWithSelectionAsm] shows assembly listings for the original and the branchless version.
 
 [TODO]: Name similar instructions on ARM side, e.g., `csel`.
 
-Listing: Predicating branches - x86 assembly code.
+Listing: Replacing Branches with Selection - x86 assembly code.
 
-~~~~ {#lst:PredicatingBranchesAsm .bash}
+~~~~ {#lst:ReplaceBranchesWithSelectionAsm .bash}
 # original version              # branchless version
 400504: test edi,edi            400537: mov eax,0x0
 400506: je 400514               40053c: call <computeX> # compute x; a = x
@@ -37,7 +35,7 @@ Listing: Predicating branches - x86 assembly code.
 
 In contrast with the original version, the branchless version doesn't have jump instructions. However, the branchless version calculates both `x` and `y` independently, and then selects one of the values and discards the other. While this transformation eliminates the penalty of a branch misprediction, it is potentially doing more work than the original code. Performance improvement, in this case, very much depends on the characteristics of `computeX` and `computeY` functions. If the functions are small and the compiler can inline them, then it might bring noticeable performance benefits. If the functions are big, it might be cheaper to take the cost of a branch mispredict than to execute both functions. 
 
-It is important to note that predication does not always benefit the performance of the application. The issue with predication is that it limits the parallel execution capabilities of the CPU. For the original version of the code, the CPU can predict that the branch will be taken, speculatively call `computeX`, and continue executing the rest of the program. This type of speculation is not possible for the branchless version as the CPU has to wait for the result of the `CMOVNE` instruction to proceed.
+It is important to note that this technique does not always benefit the performance of the application. The issue with selection is that it limits the parallel execution capabilities of the CPU. For the original version of the code, the CPU can predict that the branch will be taken, speculatively call `computeX`, and continue executing the rest of the program. This type of speculation is not possible for the branchless version as the CPU has to wait for the result of the `CMOVNE` instruction to proceed.
 
 [TODO]: Provide better explanation of the tradeoffs. Maybe I don't need this binary search SO discussion?
 
@@ -48,7 +46,7 @@ The typical example of the tradeoffs involved when choosing between the regular
 
 The binary search is a great example that shows tradeoffs between standard and branchless implementations. The real-world scenario can be more difficult to analyze, so again, measure to find out if it would be beneficial to replace branches in your case.
 
-Without profiling data, compilers don't have visibility into the misprediction rates. As a result, compilers usually prefer to generate branches, i.e., the original version, by default. They are conservative at using predication and may resist generating `CMOV` instructions even in simple cases. Again, the tradeoffs are complicated, and it is hard to make the right decision without the runtime data. Hardware-based PGO (see [@sec:secPGO]) will be a huge step forward here. Also, there is a way to indicate to the compiler that a branch condition is unpredictable by hardware mechanisms. Starting from Clang-17, the compiler now respects a `__builtin_unpredictable`, which can be very effective at replacing unpredictable branches with `CMOV` x86 instructions. For example:
+Without profiling data, compilers don't have visibility into the misprediction rates. As a result, compilers usually prefer to generate branches, i.e., the original version, by default. They are conservative at using selection and may resist generating `CMOV` instructions even in simple cases. Again, the tradeoffs are complicated, and it is hard to make the right decision without the runtime data. Hardware-based PGO (see [@sec:secPGO]) will be a huge step forward here. Also, there is a way to indicate to the compiler that a branch condition is unpredictable by hardware mechanisms. Starting from Clang-17, the compiler now respects a `__builtin_unpredictable`, which can be very effective at replacing unpredictable branches with `CMOV` x86 instructions. For example:
 
 [TODO]: WTF? I need a better example. 
 

diff --git a/chapters/10-Optimizing-Branch-Prediction/10-5 Chapter Summary.md b/chapters/10-Optimizing-Branch-Prediction/10-5 Chapter Summary.md
@@ -3,7 +3,7 @@
 \markright{Summary}
 
 * Modern processors are very good at predicting branch outcomes. So, we recommend starting the work on fixing branch mispredictions only when the TMA report points to a high `Bad Speculation` metric.
-* When branch outcome patterns become hard for the CPU branch predictor to follow, the performance of the application may suffer. In this case, the branchless version of an algorithm can be more performant. In this chapter, we showed how branches could be replaced with lookup tables, arithmetic, and predication. In some situations, it is also possible to use compiler intrinsics to eliminate branches, as shown in [@IntelAvoidingBrMisp].
+* When branch outcome patterns become hard for the CPU branch predictor to follow, the performance of the application may suffer. In this case, the branchless version of an algorithm can be more performant. In this chapter, we showed how branches could be replaced with lookup tables, arithmetic, and selection. In some situations, it is also possible to use compiler intrinsics to eliminate branches, as shown in [@IntelAvoidingBrMisp].
 * Branchless algorithms are not universally beneficial. Always measure to find out what works better in your specific case.
 
 \sectionbreak
diff --git a/chapters/11-Machine-Code-Layout-Optimizations/11-7 PGO.md b/chapters/11-Machine-Code-Layout-Optimizations/11-7 PGO.md
@@ -20,7 +20,7 @@ Another caveat in the PGO flow is that a compiler should only be trained using r
 
 An alternative solution was pioneered by Google in 2016 with sample-based PGO. [@AutoFDO] Instead of instrumenting the code, the profiling data can be obtained from the output of a standard profiling tool such as Linux `perf`. Google developed an open-source tool called [AutoFDO](https://github.com/google/autofdo)[^8] that converts sampling data generated by Linux `perf` into a format that compilers like GCC and LLVM can understand.
 
-This approach has a few advantages over instrumented PGO. First of all, it eliminates one step from the PGO build workflow, namely step 1 since there is no need to build an instrumented binary. Secondly, profiling data collection runs on an already optimized binary, thus it has a much lower runtime overhead. This makes it possible to collect profiling data in a production environment for a longer time. Since this approach is based on hardware collection, it also enables new kinds of optimizations that are not possible with instrumented PGO. One example is branch-to-cmov conversion, which is a transformation that replaces conditional jumps with conditional moves to avoid the cost of a branch misprediction (see [@sec:BranchlessPredication]). To effectively perform this transformation, a compiler needs to know how frequently the original branch was mispredicted. This information is available with sample-based PGO on modern CPUs (Intel Skylake+).
+This approach has a few advantages over instrumented PGO. First of all, it eliminates one step from the PGO build workflow, namely step 1 since there is no need to build an instrumented binary. Secondly, profiling data collection runs on an already optimized binary, thus it has a much lower runtime overhead. This makes it possible to collect profiling data in a production environment for a longer time. Since this approach is based on hardware collection, it also enables new kinds of optimizations that are not possible with instrumented PGO. One example is branch-to-cmov conversion, which is a transformation that replaces conditional jumps with conditional moves to avoid the cost of a branch misprediction (see [@sec:BranchlessSelection]). To effectively perform this transformation, a compiler needs to know how frequently the original branch was mispredicted. This information is available with sample-based PGO on modern CPUs (Intel Skylake+).
 
 The next innovative idea came from Meta in mid-2018, when it open-sourced its binary optimization tool called [BOLT](https://code.fb.com/data-infrastructure/accelerate-large-scale-applications-with-bolt/).[^9] BOLT works on the already compiled binary. It first disassembles the code, then it uses the profile information collected by a sampling profiler, such as Linux perf, to do various layout transformations and then relinks the binary again. [@BOLT] As of today, BOLT has more than 15 optimization passes, including basic block reordering, function splitting and reordering, and others. Similar to traditional PGO, primary candidates for BOLT optimizations are programs that suffer from instruction cache and iTLB misses. Since January 2022, BOLT has been a part of the LLVM project and is available as a standalone tool.