Skip to content

Commit

Permalink
Chapter 11 suggestions (#35)
Browse files Browse the repository at this point in the history
* 11-1: suggestions

* 11-6: hfsort suggestions

* 11-8: BOLT hugify

* 11-10: Suggestions

* Minor changes + some additions

* Update chapters/11-Machine-Code-Layout-Optimizations/11-8 Reducing ITLB misses.md

Co-authored-by: Amir Ayupov <[email protected]>

---------

Co-authored-by: dbakhval <dbakhval@DBAKHVAL-MOBL>
Co-authored-by: Denis Bakhvalov <[email protected]>
  • Loading branch information
3 people authored Dec 11, 2023
1 parent 4e7cf68 commit 7aafd4c
Show file tree
Hide file tree
Showing 4 changed files with 18 additions and 14 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@ Most of the time, inefficiencies in the CPU FE can be described as a situation w

The TMA methodology captures FE performance issues in the `Front-End Bound` metric. It represents the percentage of cycles when the CPU FE is not able to deliver instructions to the BE, while it could have accepted them. Most of the real-world applications experience a non-zero 'Front-End Bound' metric, meaning that some percentage of running time will be lost on suboptimal instruction fetching and decoding. Below 10\% is the norm. If you see the "Front-End Bound" metric being more than 20\%, it's definitely worth to spend time on it.

There could be many reasons why FE cannot deliver instructions to the execution units. Most of the time, it is due to suboptimal code layout, whcih leads to the poor I-cache and ITLB utilization. Applications with a large codebase, e.g. millions lines of code, are especially vulnerable to FE performance issues. In this chapter, we will take a look at some typical optimizations to improve machine code layout and increase the overall performance of the program.
There could be many reasons why FE cannot deliver instructions to the execution units. Most of the time, it is due to suboptimal code layout, which leads to the poor I-cache and ITLB utilization. Applications with a large codebase, e.g. millions lines of code, are especially vulnerable to FE performance issues. In this chapter, we will take a look at some typical optimizations to improve machine code layout and increase the overall performance of the program.

## Machine Code Layout

When a compiler translates a source code into machine code, it generates a serial byte sequence. [@lst:MachineCodeLayout] shows an example of a physical layout for a small snippet of C++ code. Once compiler finished generating assembly instructions, it needs to encode them and lay out in memory sequentially.
When a compiler translates source code into machine code, it generates a linear byte sequence. [@lst:MachineCodeLayout] shows an example of a binary layout for a small snippet of C++ code. Once compiler finished generating assembly instructions, it needs to encode them and lay out in memory sequentially.

Listing: Example of machine code layout

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ reorder functions utilization hot functions

Table: Summary of CPU Front-End optimizations. {#tbl:CPU_FE_OPT}

\personal{I think code layout improvements are often underestimated and end up being omitted and forgotten. I agree that you might want to start with low hanging fruits like loop unrolling and vectorization opportunities. But knowing that you might get an extra 5-10\% just from better laying out the machine code is still useful. It is usually the best option to use PGO, Bolt, and other tools if you can come up with a set of typical use cases for your application.}
* Code layout improvements are often underestimated and end up being omitted and forgotten. CPU Front-End performance issues like I-cache and ITLB misses represent a large portion of wasted cycles, especially for applications with large codebases. But even small- and medium-sized applications can benefit from optimizing the machine code layout.
* It is usually not the first thing developers turn their attention to when trying to improve the performance of their application. They prefer to start with low hanging fruits like loop unrolling and vectorization. However, knowing that you might get an extra 5-10\% just from better machine code layout is still useful.
* It is usually the best option to use LTO, PGO, BOLT, and other tools if you can come up with a set of typical use cases for your application. For large applications, it is the only practical way to improve machine code layout.

\sectionbreak
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,11 @@ Similar to previous optimizations, function reordering improves the utilization

The linker is responsible for laying out all the functions of the program in the resulting binary output. While developers can try to reorder functions in a program themselves, there is no guarantee on the desired physical layout. For decades people have been using linker scripts to achieve this goal. Still, this is the way to go if you are using the GNU linker. The Gold linker (`ld.gold`) has an easier approach to this problem. To get the desired ordering of functions in the binary with the Gold linker, one can first compile the code with the `-ffunction-sections` flag, which will put each function into a separate section. Then use [`--section-ordering-file=order.txt`](https://manpages.debian.org/unstable/binutils/x86_64-linux-gnu-ld.gold.1.en.html) option to provide a file with a sorted list of function names that reflects the desired final layout. The same feature exists in the LLD linker, which is a part of LLVM compiler infrastructure and is accessible via the `--symbol-ordering-file` option.

An interesting approach to solving the problem of grouping hot functions together is implemented in the tool called [HFSort](https://github.com/facebook/hhvm/tree/master/hphp/tools/hfsort).[^1] It is a tool that generates the section ordering file automatically based on profiling data [@HfSort]. Using this tool, engineers from Meta got a 2% performance speedup of large distributed cloud applications like Facebook, Baidu, and Wikipedia. Right now, HFSort is integrated into Meta's HHVM project and is not available as a standalone tool. However, the LLD linker employs an implementation[^2] of the HFSort algorithm, which sorts sections based on the profiling data.
An interesting approach to solving the problem of grouping hot functions together was introduced in 2017 by engineers from Meta. They implemented a tool called [HFSort](https://github.com/facebook/hhvm/tree/master/hphp/tools/hfsort)[^1], that generates the section ordering file automatically based on profiling data [@HfSort]. Using this tool, they observed a 2\% performance speedup of large distributed cloud applications like Facebook, Baidu, and Wikipedia. HFSort has been integrated into Meta's HHVM, LLVM BOLT, and LLD linker[^2]. Since then, the algorithm has been superseded first by HFSort+, and most recently by Cache-Directed Sort (CDSort[^3]), with more improvements for workloads with large code footprint.

[^1]: HFSort - [https://github.com/facebook/hhvm/tree/master/hphp/tools/hfsort](https://github.com/facebook/hhvm/tree/master/hphp/tools/hfsort).

[^2]: HFSort in LLD - [https://github.com/llvm-project/lld/blob/master/ELF/CallGraphSort.cpp](https://github.com/llvm-project/lld/blob/master/ELF/CallGraphSort.cpp).
[^1]: HFSort - [https://github.com/facebook/hhvm/tree/master/hphp/tools/hfsort](https://github.com/facebook/hhvm/tree/master/hphp/tools/hfsort)

[^2]: HFSort in LLD - [https://github.com/llvm-project/lld/blob/master/ELF/CallGraphSort.cpp](https://github.com/llvm-project/lld/blob/master/ELF/CallGraphSort.cpp)

[^3]: Cache-Directed Sort in LLVM - [https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Utils/CodeLayout.cpp](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Utils/CodeLayout.cpp)
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,15 @@ typora-root-url: ..\..\img

## Reducing ITLB Misses {#sec:FeTLB}

Another important area of tuning FE efficiency is virtual-to-physical address translation of memory addresses. Primarily those translations are served by TLB (see [@sec:TLBs]), which caches most recently used memory page translations in dedicated entries. When TLB cannot serve the translation request, a time-consuming page walk of the kernel page table takes place to calculate the correct physical address for each referenced virtual address. Whenever you see a high percentage of [ITLB overhead](https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metrics-reference/front-end-bound/itlb-overhead.html) [^11] in the TMA summary, the advice in this section may become handy.
Another important area of tuning FE efficiency is virtual-to-physical address translation of memory addresses. Primarily those translations are served by TLB (see [@sec:TLBs]), which caches most recently used memory page translations in dedicated entries. When TLB cannot serve the translation request, a time-consuming page walk of the kernel page table takes place to calculate the correct physical address for each referenced virtual address. Whenever you see a high percentage of ITLB overhead in the TMA summary, the advice in this section may become handy.

In general, relatively small applications are not susceptible to ITLB misses. For example, Golden Cove microarchitecture can cover memory space up to 1MB in its ITLB. If machine code of your application fits in 1MB you should not be affected by ITLB misses. The problem start to appear when frequently executed parts of an application are scattered around the memory. When many functions begin to frequently call each other, they start competing for the entries in the ITLB. One of the examples is the Clang compiler, which at the time of writing, has a code section of ~60MB. ITLB overhead running on a laptop with a mainstream Intel CoffeeLake processor is ~7%, which means that 7% of cycles are wasted handling ITLB misses: doing demanding page walks and populating TLB entries.

Another set of large memory applications that frequently benefit from using huge pages include relational databases (e.g., MySQL, PostgreSQL, Oracle), managed runtimes (e.g. Javascript V8, Java JVM), cloud services (e.g. web search), web tooling (e.g. node.js). Mapping code sections onto the huge pages can reduce the number of ITLB misses by up to 50% [@IntelBlueprint], which yields speedups of up to 10% for some applications. However, as it is with many other features, huge pages are not for every application. Small programms with an executable file of only a few KB in size would be better off using regular 4KB pages rather than 2MB huge pages; that way, memory is used more efficiently.
Another set of large memory applications that frequently benefit from using huge pages include relational databases (e.g., MySQL, PostgreSQL, Oracle), managed runtimes (e.g. Javascript V8, Java JVM), cloud services (e.g. web search), web tooling (e.g. node.js). Mapping code sections onto the huge pages can reduce the number of ITLB misses by up to 50% [@IntelBlueprint], which yields speedups of up to 10% for some applications. However, as it is with many other features, huge pages are not for every application. Small programs with an executable file of only a few KB in size would be better off using regular 4KB pages rather than 2MB huge pages; that way, memory is used more efficiently.

The general idea of reducing ITLB pressure is by mapping the portions of the performance-critical code of an application onto 2MB (huge) pages. But usually, the entire code section of an application gets remapped for simplicity or if you don't know which functions are hot. The key requirement for that transformation to happen is to have code section aligned on 2MB boundary. When on Linux, this can be achieved in two different ways: relinking the binary with additional linker option or remapping the code sections at runtime. Both options are showcased on easyperf.net[^1] blog.
The general idea of reducing ITLB pressure is by mapping the portions of the performance-critical code of an application onto 2MB (huge) pages. But usually, the entire code section of an application gets remapped for simplicity or if you don't know which functions are hot. The key requirement for that transformation to happen is to have code section aligned on 2MB boundary. When on Linux, this can be achieved in two different ways: relinking the binary with additional linker option or remapping the code sections at runtime. Both options are showcased on easyperf.net[^1] blog. To the best of our knowledge, it is not possible on Windows, so we will only show how to do it on Linux.

**TODO: how to do it on Windows?**

The first option can be achieved by linking the binary with `-Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-size=2097152` options. These options instruct the linker to place the code section at the 2MB boundary in preparation for it to be placed on 2MB pages by the loader at startup. The downside of such placement is that linker will be forced to insert up to 2MB of padded (wasted) bytes, bloating the binary even more. In the example with Clang compiler, it increased the size of the binary from 111 MB to 114 MB. After relinking the binary, we that determines if the text segment should be backed by default with huge pages. The simplest ways to do it is using the `hugeedit` or `hugectl` utilities from [libhugetlbfs](https://github.com/libhugetlbfs/libhugetlbfs/blob/master/HOWTO)[^12] package. For example:
The first option can be achieved by linking the binary with `-Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-size=2097152` options. These options instruct the linker to place the code section at the 2MB boundary in preparation for it to be placed on 2MB pages by the loader at startup. The downside of such placement is that linker will be forced to insert up to 2MB of padded (wasted) bytes, bloating the binary even more. In the example with Clang compiler, it increased the size of the binary from 111 MB to 114 MB. After relinking the binary, we set a special bit in the ELF binary header that determines if the text segment should be backed with huge pages by default. The simplest way to do it is using the `hugeedit` or `hugectl` utilities from [libhugetlbfs](https://github.com/libhugetlbfs/libhugetlbfs/blob/master/HOWTO)[^12] package. For example:

```bash
# Permanently set a special bit in the ELF binary header.
Expand All @@ -34,9 +32,10 @@ $ LD_PRELOAD=/usr/lib64/liblppreload.so clang++ a.cpp

While the first method only works with explicit huge pages, the second approach which uses `iodlr` works both with explicit and transparent huge pages. Instructions on how to enable huge pages for Windows and Linux can be found in appendix C.

Besides from employing huge pages, standard techniques for optimizing I-cache performance can be used for improving ITLB performance. Namely, reordering functions so that hot functions are collocated better, reducing the size of hot regions via Link-Time Optimizations (LTO/IPO), using Profile-Guided Optimizations (PGO), and less aggressive inlining.
Besides from employing huge pages, standard techniques for optimizing I-cache performance can be used for improving ITLB performance. Namely, reordering functions so that hot functions are collocated better, reducing the size of hot regions via Link-Time Optimizations (LTO/IPO), using Profile-Guided Optimizations (PGO) and BOLT, and less aggressive inlining.

BOLT provides the `-hugify` option to automatically use huge pages for hot code based on profile data. When this option is used, `llvm-bolt` will inject the code to put hot code on 2MB pages at runtime. The implementation leverages Linux Transparent Huge Pages (THP). The benefit of this approach is that only a small portion of the code is mapped to the huge pages and the number of required huge pages is minimized, and as a consequence, page fragmentation is reduced.

[^1]: "Performance Benefits of Using Huge Pages for Code" - [https://easyperf.net/blog/2022/09/01/Utilizing-Huge-Pages-For-Code](https://easyperf.net/blog/2022/09/01/Utilizing-Huge-Pages-For-Code).
[^2]: iodlr library - [https://github.com/intel/iodlr](https://github.com/intel/iodlr).
[^11]: ITLB Overhead - [https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metrics-reference/front-end-bound/itlb-overhead.html](https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metrics-reference/front-end-bound/itlb-overhead.html)
[^12]: libhugetlbfs - [https://github.com/libhugetlbfs/libhugetlbfs/blob/master/HOWTO](https://github.com/libhugetlbfs/libhugetlbfs/blob/master/HOWTO).

0 comments on commit 7aafd4c

Please sign in to comment.