From 1cb1a7763f6b922ffbda1bc2bbe138d0d3453a98 Mon Sep 17 00:00:00 2001 From: Denis Bakhvalov Date: Sun, 22 Sep 2024 13:54:00 -0400 Subject: [PATCH] [Grammar] Update 10-3 Replace branches with predication.md --- .../10-3 Replace branches with predication.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/chapters/10-Optimizing-Branch-Prediction/10-3 Replace branches with predication.md b/chapters/10-Optimizing-Branch-Prediction/10-3 Replace branches with predication.md index c565482d56..d50ec92c0d 100644 --- a/chapters/10-Optimizing-Branch-Prediction/10-3 Replace branches with predication.md +++ b/chapters/10-Optimizing-Branch-Prediction/10-3 Replace branches with predication.md @@ -35,13 +35,13 @@ Listing: Replacing Branches with Selection - x86 assembly code. We already know that the branch in the original version on the left is hard to predict. This is what motivates us to try a branchless version in the first place. In this example, the performance gain of this change depends on the characteristics of `computeX` and `computeY` functions. If the functions are small[^1] and the compiler can inline them, then selection might bring noticeable performance benefits. If the functions are big[^2], it might be cheaper to take the cost of a branch mispredict than to execute both `computeX` and `computeY` functions. -Take a look at [@lst:ReplaceBranchesWithSelectionAsm] one more time. One the left, a processor can predict, for example, that the `je 400514` branch will be taken, speculatively call `computeY`, and start running code from the function `foo`. Remember, branch prediction usually happens many cycles before we know the actual result. By the time we start resolving the branch, we could be already halfway through `foo` function, despite it is still speculative. If we were correct, we've saved a lot of cycles. If we were wrong, we have to take the penalty and start over from the correct path. In the latter case, we don't gain anything from the fact that we have already completed a portion of `foo`, it all must be thrown away. If the mispredictions occur too often, the recovering penalty outweighs the gains from speculative execution. +Take a look at [@lst:ReplaceBranchesWithSelectionAsm] one more time. On the left, a processor can predict, for example, that the `je 400514` branch will be taken, speculatively call `computeY`, and start running code from the function `foo`. Remember, branch prediction usually happens many cycles before we know the actual result. By the time we start resolving the branch, we could be already halfway through the `foo` function, despite it is still speculative. If we are correct, we've saved a lot of cycles. If we are wrong, we have to take the penalty and start over from the correct path. In the latter case, we don't gain anything from the fact that we have already completed a portion of `foo`, it all must be thrown away. If the mispredictions occur too often, the recovering penalty outweighs the gains from speculative execution. With conditional selection, it is different. There are no branches, so the processor doesn't have to speculate. It can execute `computeX` and `computeY` functions in parallel. However, it cannot start running the code from `foo` until it computes the result of the `CMOVNE` instruction since `foo` uses it as an argument (data dependency). When you use conditional select instructions, you convert a control flow dependency into a data flow dependency. -To sum it up, for small `if ... else` statements that perform simple operations, conditional selects can be more efficient than branches, but only if the branch is hard to predict. So don't force compiler to generate conditional selects for every conditional statement. For conditional statements that are always correctly predicted, having a branch instruction is likely an optimal choice, because you allow the processor to speculate (correctly) and run ahead of the actual execution. And don't forget to measure the impact of your changes. +To sum it up, for small `if ... else` statements that perform simple operations, conditional selects can be more efficient than branches, but only if the branch is hard to predict. So don't force the compiler to generate conditional selects for every conditional statement. For conditional statements that are always correctly predicted, having a branch instruction is likely an optimal choice, because you allow the processor to speculate (correctly) and run ahead of the actual execution. And don't forget to measure the impact of your changes. -Without profiling data, compilers don't have visibility into the misprediction rates. As a result, they usually prefer to generate branch instructions by default. Compilers are conservative at using selection and may resist generating `CMOV` instructions even in simple cases. Again, the tradeoffs are complicated, and it is hard to make the right decision without the runtime data.[^4] Starting from Clang-17, the compiler now honors a `__builtin_unpredictable` hint for the x86 target, which indicates to the compiler that a branch condition is unpredictable. It can help influencing the compiler's decision but does not guarantee that the `CMOV` instruction will be generated. For example: +Without profiling data, compilers don't have visibility into the misprediction rates. As a result, they usually prefer to generate branch instructions by default. Compilers are conservative at using selection and may resist generating `CMOV` instructions even in simple cases. Again, the tradeoffs are complicated, and it is hard to make the right decision without the runtime data.[^4] Starting from Clang-17, the compiler now honors a `__builtin_unpredictable` hint for the x86 target, which indicates to the compiler that a branch condition is unpredictable. It can help influence the compiler's decision but does not guarantee that the `CMOV` instruction will be generated. For example: ```cpp int a; @@ -52,6 +52,6 @@ if (__builtin_unpredictable(cond)) { } ``` -[^1]: Just a handfull instructions that can be completed in a few cycles. +[^1]: Just a handful of instructions that can be completed in a few cycles. [^2]: More than twenty instructions that take more than twenty cycles. [^4]: Hardware-based PGO (see [@sec:secPGO]) will be a huge step forward here.