Skip to content

Commit

Permalink
[Grammar] 11-7 PGO.md
Browse files Browse the repository at this point in the history
  • Loading branch information
dendibakh authored Aug 10, 2024
1 parent 9e515ea commit 74a7493
Showing 1 changed file with 10 additions and 10 deletions.
20 changes: 10 additions & 10 deletions chapters/11-Machine-Code-Layout-Optimizations/11-7 PGO.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,29 +2,29 @@

## Profile Guided Optimizations {#sec:secPGO}

Compiling a program and generating optimal assembly is all about heuristics. Code transformation algorithms have many corner cases that aim for optimal performance in specific situations. For a lot of decisions that a compiler makes, it tries to guess the best choice based on some typical cases. For example, when deciding whether a particular function should be inlined, the compiler could take into account the number of times this function will be called. The problem is that compiler doesn't know that beforehand. It first needs to run the program to find out. Without any runtime information, the compiler will have to make a guess.
Compiling a program and generating optimal assembly is all about heuristics. Code transformation algorithms have many corner cases that aim for optimal performance in specific situations. For a lot of decisions that a compiler makes, it tries to guess the best choice based on some typical cases. For example, when deciding whether a particular function should be inlined, the compiler could take into account the number of times this function will be called. The problem is that the compiler doesn't know that beforehand. It first needs to run the program to find out. Without any runtime information, the compiler will have to make a guess.

Here is when profiling information becomes handy. Given profiling information, compiler can make better optimization decisions. There is a set of transformations in most compilers that can adjust their algorithms based on profiling data fed back to them. This set of transformations is called Profile Guided Optimizations (PGO). When profiling data is available, a compiler can use it to direct optimizations. Otherwise, it will fall back to using its standard algorithms and heuristics. Sometimes in literature, you can find the term Feedback Directed Optimizations (FDO), which refers to the same thing as PGO.
Here is when profiling information becomes handy. Given profiling information, the compiler can make better optimization decisions. There is a set of transformations in most compilers that can adjust their algorithms based on profiling data fed back to them. This set of transformations is called Profile Guided Optimizations (PGO). When profiling data is available, a compiler can use it to direct optimizations. Otherwise, it will fall back to using its standard algorithms and heuristics. Sometimes in literature, you can find the term Feedback Directed Optimizations (FDO), which refers to the same thing as PGO.

Figure @fig:PGO_flow shows a traditional workflow of using PGO, also called *instrumented PGO*. First, you compile your program and tell the compiler to automatically instrument the code. This will insert some bookkeeping code into functions to collect runtime statistics. Second step is to run the instrumented binary with an input data that represents a typical workload for your application. This will generate the profiling data, a new file with runtime statistics. It is a raw dump file with information about function call counts, loop iteration counts, and other basic block hit counts. The final step in this workflow is to recompile the program with the profiling data to produce optimized executable.
Figure @fig:PGO_flow shows a traditional workflow of using PGO, also called *instrumented PGO*. First, you compile your program and tell the compiler to automatically instrument the code. This will insert some bookkeeping code into functions to collect runtime statistics. The second step is to run the instrumented binary with input data that represents a typical workload for your application. This will generate the profiling data, a new file with runtime statistics. It is a raw dump file with information about function call counts, loop iteration counts, and other basic block hit counts. The final step in this workflow is to recompile the program with the profiling data to produce the optimized executable.

![Instrumented PGO workflow.](../../img/cpu_fe_opts/pgo_flow.png){#fig:PGO_flow width=100% }

Developers can enable PGO instrumentation (step 1) in the LLVM compiler by building the program with the `-fprofile-instr-generate` option. This will instruct the compiler to instrument the code, which will collect profiling information at runtime. After that, the LLVM compiler can consume profiling data with the `-fprofile-instr-use` option to recompile the program and output a PGO-tuned binary. The guide for using PGO in clang is described in the [documentation](https://clang.llvm.org/docs/UsersManual.html#profiling-with-instrumentation).[^7] GCC compiler uses different set of options: `-fprofile-generate` and `-fprofile-use` as described in the [documentation](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options).[^10]
Developers can enable PGO instrumentation (step 1) in the LLVM compiler by building the program with the `-fprofile-instr-generate` option. This will instruct the compiler to instrument the code, which will collect profiling information at runtime. After that, the LLVM compiler can consume profiling data with the `-fprofile-instr-use` option to recompile the program and output a PGO-tuned binary. The guide for using PGO in Clang is described in the [documentation](https://clang.llvm.org/docs/UsersManual.html#profiling-with-instrumentation).[^7] GCC compiler uses a different set of options: `-fprofile-generate` and `-fprofile-use` as described in the [documentation](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options).[^10]

PGO helps the compiler to improve function inlining, code placement, register allocation, and other code transformations. PGO is primarily used in projects with a large codebase, for example, Linux kernel, compilers, databases, web browsers, video games, productivity tools and others. For applications with millions lines of code, it is the only practical way to improve machine code layout. It is not uncommon to see performance of production workloads increase by 10-25% from using Profile Guided Optimizations.
PGO helps the compiler to improve function inlining, code placement, register allocation, and other code transformations. PGO is primarily used in projects with a large codebase, for example, Linux kernel, compilers, databases, web browsers, video games, productivity tools, and others. For applications with millions of lines of code, it is the only practical way to improve machine code layout. It is not uncommon to see the performance of production workloads increase by 10-25% from using Profile Guided Optimizations.

While many software projects adopted instrumented PGO as a part of their build process, the rate of adoption is still very low. There are a few reasons for that. The primary reason is a huge runtime overhead of instrumented executables. Running an instrumented binary and collecting profiling data frequently incurs 5-10x slowdown, which makes the build step longer and prevents profile collection directly from production systems, whether on client devices or in the cloud. Unfortunately, you cannot collect the profiling data once and use it for all the future builds. As the source code of an application evolves, the profile data becomes stale (out of sync) and needs to be recollected.
While many software projects adopted instrumented PGO as a part of their build process, the rate of adoption is still very low. There are a few reasons for that. The primary reason is the huge runtime overhead of instrumented executables. Running an instrumented binary and collecting profiling data frequently incurs a 5-10x slowdown, which makes the build step longer and prevents profile collection directly from production systems, whether on client devices or in the cloud. Unfortunately, you cannot collect the profiling data once and use it for all future builds. As the source code of an application evolves, the profile data becomes stale (out of sync) and needs to be recollected.

Another caveat in the PGO flow is that a compiler should only be trained using representative scenarios of how your application will be used. Otherwise, you may end up degrading program's performance. The compiler "blindly" uses the profile data that you provided. It assumes that the program will always behave the same no matter what the input data is. Users of PGO should be careful about choosing the input data they will use for collecting profiling data (step 2) because while improving one use case of the application, others may be pessimized. Luckily, it doesn't have to be exactly a single workload since profile data from different workloads can be merged together to represent a set of use cases for the application.
Another caveat in the PGO flow is that a compiler should only be trained using representative scenarios of how your application will be used. Otherwise, you may end up degrading the program's performance. The compiler "blindly" uses the profile data that you provided. It assumes that the program will always behave the same no matter what the input data is. Users of PGO should be careful about choosing the input data they will use for collecting profiling data (step 2) because while improving one use case of the application, others may be pessimized. Luckily, it doesn't have to be exactly a single workload since profile data from different workloads can be merged to represent a set of use cases for the application.

An alternative solution was pioneered by Google in 2016 with sample-based PGO. [@AutoFDO] Instead of instrumenting the code, the profiling data can be obtained from the output of a standard profiling tool such as Linux `perf`. Google developed an open-source tool called [AutoFDO](https://github.com/google/autofdo)[^8] that converts sampling data generated by Linux `perf` into a format that compilers like GCC and LLVM can understand.

This approach has a few advantages over instrumented PGO. First of all, it eliminates one step from the PGO build workflow, namely step 1 since there is no need to build an instrumented binary. Secondly, profiling data collection runs on an already optimized binary, thus it has a much lower runtime overhead. This makes it possible to collect profiling data in production environment for a longer period of time. Since this approach is based on hardware collection, it also enables new kinds of optimizations that are not possible with instrumented PGO. One example is branch-to-cmov conversion, which is a transformation that replaces conditional jumps with conditional moves to avoid the cost of a branch misprediction (see [@sec:BranchlessPredication]). To effectively perform this transformation, a compiler needs to know how frequently the original branch was mispredicted. This information is available with sample-based PGO on modern CPUs (Intel Skylake+).
This approach has a few advantages over instrumented PGO. First of all, it eliminates one step from the PGO build workflow, namely step 1 since there is no need to build an instrumented binary. Secondly, profiling data collection runs on an already optimized binary, thus it has a much lower runtime overhead. This makes it possible to collect profiling data in a production environment for a longer time. Since this approach is based on hardware collection, it also enables new kinds of optimizations that are not possible with instrumented PGO. One example is branch-to-cmov conversion, which is a transformation that replaces conditional jumps with conditional moves to avoid the cost of a branch misprediction (see [@sec:BranchlessPredication]). To effectively perform this transformation, a compiler needs to know how frequently the original branch was mispredicted. This information is available with sample-based PGO on modern CPUs (Intel Skylake+).

The next innovative idea came from Meta in the mid-2018, when it open-sourced its binary optimization tool called [BOLT](https://code.fb.com/data-infrastructure/accelerate-large-scale-applications-with-bolt/).[^9] BOLT works on the already compiled binary. It first disassembles the code, then it uses the profile information collected by sampling profiler, such as Linux perf, to do various layout transformations and then relinks the binary again. [@BOLT] As of today, BOLT has more than 15 optimization passes, including basic blocks reordering, function splitting and reordering, and others. Similar to traditional PGO, primary candidates for BOLT optimizations are programs that suffer from many instruction cache and iTLB misses. Since January 2022, BOLT is a part of the LLVM project and is available as a standalone tool.
The next innovative idea came from Meta in mid-2018, when it open-sourced its binary optimization tool called [BOLT](https://code.fb.com/data-infrastructure/accelerate-large-scale-applications-with-bolt/).[^9] BOLT works on the already compiled binary. It first disassembles the code, then it uses the profile information collected by a sampling profiler, such as Linux perf, to do various layout transformations and then relinks the binary again. [@BOLT] As of today, BOLT has more than 15 optimization passes, including basic block reordering, function splitting and reordering, and others. Similar to traditional PGO, primary candidates for BOLT optimizations are programs that suffer from instruction cache and iTLB misses. Since January 2022, BOLT has been a part of the LLVM project and is available as a standalone tool.

A few years after BOLT was introduced, Google open-sourced its binary relinking tool called [Propeller](https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf). It serves a similar purpose but instead of disassembling the original binary, it relies on linker input, and thus can be distributed across several machines for better scaling and less memory consumption. Post-link optimizers such as BOLT and Propeller can be used in combination with traditional PGO (and LTO) and often provide additional 5-10% performance speedup. Such techniques open up new kinds of binary rewriting optimizations that are based on hardware telemetry.
A few years after BOLT was introduced, Google open-sourced its binary relinking tool called [Propeller](https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf). It serves a similar purpose but instead of disassembling the original binary, it relies on linker input and thus can be distributed across several machines for better scaling and less memory consumption. Post-link optimizers such as BOLT and Propeller can be used in combination with traditional PGO (and LTO) and often provide an additional 5-10% performance speedup. Such techniques open up new kinds of binary rewriting optimizations that are based on hardware telemetry.

[^7]: PGO in Clang - [https://clang.llvm.org/docs/UsersManual.html#profiling-with-instrumentation](https://clang.llvm.org/docs/UsersManual.html#profiling-with-instrumentation)
[^8]: AutoFDO - [https://github.com/google/autofdo](https://github.com/google/autofdo)
Expand Down

0 comments on commit 74a7493

Please sign in to comment.