Skip to content

Commit

Permalink
[Grammar] 11-3 Basic Block Placement.md
Browse files Browse the repository at this point in the history
  • Loading branch information
dendibakh authored Aug 10, 2024
1 parent d8c4948 commit 7637691
Showing 1 changed file with 7 additions and 9 deletions.
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@


## Basic Block Placement {#sec:secLIKELY}

Suppose we have a hot path in the program that has some error handling code (`coldFunc`) in between:
Expand All @@ -10,7 +8,7 @@ if (cond)
coldFunc();
// hot path again
```
Figure @fig:BBLayout shows two possible physical layouts for this snippet of code. Figure @fig:BB_default is the layout most compiler will emit by default, given no hints provided. The layout that is shown in Figure @fig:BB_better can be achieved if we invert the condition `cond` and place hot code as fall through.
Figure @fig:BBLayout shows two possible physical layouts for this snippet of code. Figure @fig:BB_default is the layout most compilers will emit by default, given no hints are provided. The layout that is shown in Figure @fig:BB_better can be achieved if we invert the condition `cond` and place hot code as fall through.

<div id="fig:BBLayout">
![default layout](../../img/cpu_fe_opts/BBLayout_Default.png){#fig:BB_default width=50%}
Expand All @@ -19,11 +17,11 @@ Figure @fig:BBLayout shows two possible physical layouts for this snippet of cod
Two versions of machine code layout for the snippet of code above.
</div>

Which layout is better? Well, it depends on whether `cond` is usually true or false. If `cond` is usually true, then we would better choose the default layout because otherwise, we would be doing two jumps instead of one. Also, in the general case, if `coldFunc` is a relatively small function, we would want to have it inlined. However, in this particular example, we know that `coldFunc` is an error handling function and is likely not executed very often. By choosing layout @fig:BB_better, we maintain fall through between hot pieces of the code and convert taken branch into not taken one.
Which layout is better? Well, it depends on whether `cond` is usually true or false. If `cond` is usually true, then we would better choose the default layout because otherwise, we would be doing two jumps instead of one. Also, in the general case, if `coldFunc` is a relatively small function, we would want to have it inlined. However, in this particular example, we know that `coldFunc` is an error-handling function and is likely not executed very often. By choosing layout @fig:BB_better, we maintain fall through between hot pieces of the code and convert the taken branch into not taken one.

There are a few reasons why the layout presented in Figure @fig:BB_better performs better. First of all, layout in Figure @fig:BB_better makes better use of the instruction and $\mu$op-cache (DSB, see [@sec:uarchFE]). With all hot code contiguous, there is no cache line fragmentation: all the cache lines in the L1I-cache are used by hot code. The same is true for the $\mu$op-cache since it caches based on the underlying code layout as well. Secondly, taken branches are also more expensive for the fetch unit. The Front-End of a CPU fetches contiguous chunks of bytes, so every taken jump means the bytes after the jump are useless. This reduces the maximum effective fetch throughput. Finally, on some architectures, not taken branches are fundamentally cheaper than taken. For instance, Intel Skylake CPUs can execute two untaken branches per cycle but only one taken branch every two cycles.[^2]
There are a few reasons why the layout presented in Figure @fig:BB_better performs better. First of all, the layout in Figure @fig:BB_better makes better use of the instruction and $\mu$op-cache (DSB, see [@sec:uarchFE]). With all hot code contiguous, there is no cache line fragmentation: all the cache lines in the L1I-cache are used by hot code. The same is true for the $\mu$op-cache since it caches based on the underlying code layout as well. Secondly, taken branches are also more expensive for the fetch unit. The Front-End of a CPU fetches contiguous chunks of bytes, so every taken jump means the bytes after the jump are useless. This reduces the maximum effective fetch throughput. Finally, on some architectures, not-taken branches are fundamentally cheaper than taken. For instance, Intel Skylake CPUs can execute two untaken branches per cycle but only one taken branch every two cycles.[^2]

To suggest a compiler to generate an improved version of the machine code layout, one can provide a hint using `[[likely]]` and `[[unlikely]]` attributes, which is available since C++20. The code that uses this hint will look like this:
To suggest a compiler to generate an improved version of the machine code layout, one can provide a hint using `[[likely]]` and `[[unlikely]]` attributes, which have been available since C++20. The code that uses this hint will look like this:

```cpp
// hot path
Expand All @@ -32,7 +30,7 @@ if (cond) [[unlikely]]
// hot path again
```

In the code above, `[[unlikely]]` hint will instruct the compiler that `cond` is unlikely to be true, so compiler should adjust the code layout accordingly. Prior to C++20, developers could have used [`__builtin_expect`](https://llvm.org/docs/BranchWeightMetadata.html#builtin-expect)[^3] construct and they usually created `LIKELY` wrapper hints themselves to make the code more readable. For example:
In the code above, the `[[unlikely]]` hint will instruct the compiler that `cond` is unlikely to be true, so the compiler should adjust the code layout accordingly. Prior to C++20, developers could have used [`__builtin_expect`](https://llvm.org/docs/BranchWeightMetadata.html#builtin-expect)[^3] construct and they usually created `LIKELY` wrapper hints themselves to make the code more readable. For example:

```cpp
#define LIKELY(EXPR) __builtin_expect((bool)(EXPR), true)
Expand All @@ -43,7 +41,7 @@ if (UNLIKELY(cond)) // NOT
// hot path again
```

Optimizing compilers will not only improve code layout when they encounter "likely/unlikely" hints. They will also leverage this information in other places. For example, when `[[unlikely]]` attribute is applied, the compiler will prevent inlining `coldFunc` since it now knows that it is unlikely to be executed often and it's more beneficial to optimize it for size, i.e., just leave a `CALL` to this function. Inserting `[[likely]]` attribute is also possible for a switch statement as presented in [@lst:BuiltinSwitch].
Optimizing compilers will not only improve code layout when they encounter "likely/unlikely" hints. They will also leverage this information in other places. For example, when the `[[unlikely]]` attribute is applied, the compiler will prevent inlining `coldFunc` since it now knows that it is unlikely to be executed often and it's more beneficial to optimize it for size, i.e., just leave a `CALL` to this function. Inserting the `[[likely]]` attribute is also possible for a switch statement as presented in [@lst:BuiltinSwitch].

Listing: Likely attribute used in a switch statement

Expand All @@ -60,6 +58,6 @@ for (;;) {
Using this hint, a compiler will be able to reorder code a little bit differently and optimize the hot switch for faster processing of `ADD` instructions.
[^2]: Though, there is a special small loop optimization that allows very small loops to have one taken branch per cycle.
[^2]: However, there is a special small loop optimization that allows very small loops to have one taken branch per cycle.
[^3]: More about builtin-expect here: [https://llvm.org/docs/BranchWeightMetadata.html#builtin-expect](https://llvm.org/docs/BranchWeightMetadata.html#builtin-expect).
[^10]: C++ standard `[[likely]]` attribute: [https://en.cppreference.com/w/cpp/language/attributes/likely](https://en.cppreference.com/w/cpp/language/attributes/likely).

0 comments on commit 7637691

Please sign in to comment.