Skip to content

Commit

Permalink
[Grammar] 12-1 CPU-Specific Optimizations.md
Browse files Browse the repository at this point in the history
  • Loading branch information
dendibakh authored Aug 10, 2024
1 parent f9cc9ae commit 58a0895
Showing 1 changed file with 9 additions and 9 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ The major differences between x86 (considered as CISC) and RISC ISAs, such as AR
* x86 instructions are variable-length, while ARM and RISC-V instructions are fixed-length. This makes decoding x86 instructions more complex.
* x86 ISA has many addressing modes, while ARM and RISC-V have few addressing modes. Operands in ARM and RISC-V instructions are either registers or immediate values, while x86 instruction inputs can also come from memory. This bloats the number of x86 instructions but also allows for more powerful single instructions. For instance, ARM requires loading a memory location first, then performing the operation; x86 can do both in one instruction.

In addition to this, there are a few other differences that you should consider when optimizing for a specific microarchitecture. As of 2024, the most recent x86-64 ISA has 16 architectural general-purpose registers, while the latest ARMv8 and RV64 require a CPU to provide 32 general-purpose registers. Extra architectural registers reduce register spilling and hence reduce the number of loads/stores. Intel has announced a new extension called APX[^1] that will increase the number of registers to 32. There is also a difference in the memory page size between x86 and ARM. The default page size for x86 platforms is 4 KB, while most ARM systems (for example, Apple MacBooks) use a 16 KB page size, although both platforms support larger page sizes (see [@sec:ArchHugePages], and [@sec:secDTLB]). All these differences can affect performance of your application when it becomes a bottleneck.
In addition to this, there are a few other differences that you should consider when optimizing for a specific microarchitecture. As of 2024, the most recent x86-64 ISA has 16 architectural general-purpose registers, while the latest ARMv8 and RV64 require a CPU to provide 32 general-purpose registers. Extra architectural registers reduce register spilling and hence reduce the number of loads/stores. Intel has announced a new extension called APX[^1] that will increase the number of registers to 32. There is also a difference in the memory page size between x86 and ARM. The default page size for x86 platforms is 4 KB, while most ARM systems (for example, Apple MacBooks) use a 16 KB page size, although both platforms support larger page sizes (see [@sec:ArchHugePages], and [@sec:secDTLB]). All these differences can affect the performance of your application when it becomes a bottleneck.

Although ISA differences *may* have a tangible impact on the performance of a specific application, numerous studies show that on average, differences between the two most popular ISAs, namely x86 and ARM, don't have a measurable performance impact. Throughout this book, we carefully avoided advertisements of any products (e.g., Intel vs. AMD vs. Apple) and any religious ISA debates (x86 vs. ARM vs. RISC-V).[^5] Below are some references that we hope will close the debate:

Expand All @@ -16,7 +16,7 @@ Although ISA differences *may* have a tangible impact on the performance of a sp
* CISC code is not denser than RISC code. [@CodeDensityCISCvsRISC]
* ISA overheads can be effectively mitigated by microarchitecture implementation. For example, $\mu$op cache minimizes decoding overheads; instruction cache minimizes code density impact. [@RISCvsCISC2013] [@ChipsAndCheesex86]

Nevertheless, this doesn't remove the value of architecture-specific optimizations. In this section, we will discuss how to optimize for a particular platform. We will cover ISA extensions, CPU dispatch and discuss how to reason about instruction latencies and throughput.
Nevertheless, this doesn't remove the value of architecture-specific optimizations. In this section, we will discuss how to optimize for a particular platform. We will cover ISA extensions, CPU dispatch, and discuss how to reason about instruction latencies and throughput.

### ISA Extensions

Expand All @@ -43,7 +43,7 @@ Here is a list of some notable ARM ISA extensions:
* sve: enables scalable vector length instructions.
* sme: Scalable Matrix Extension for accelerating matrix multiplication.

When compiling your applications, make sure to enable the necessary compiler flags to activate required ISA extensions. On GCC and Clang compilers use `-march` option. For example, `-march=native` will activate ISA features of your host system, i.e., on which you run the compilation. Or you can include a specific version of ISA, e.g., `-march=armv8.6-a`. On the MSVC compiler, use `/arch` option, e.g., `/arch:AVX2`.
When compiling your applications, make sure to enable the necessary compiler flags to activate required ISA extensions. On GCC and Clang compilers use the `-march` option. For example, `-march=native` will activate ISA features of your host system, i.e., on which you run the compilation. Or you can include a specific version of ISA, e.g., `-march=armv8.6-a`. On the MSVC compiler, use the `/arch` option, e.g., `/arch:AVX2`.

### CPU Dispatch

Expand All @@ -57,23 +57,23 @@ if (__builtin_cpu_supports ("avx512f")) {
}
```

This example demonstrates the use of built-in functions that are available in GCC and Clang compilers. Besides detecting supported ISA extensions, there is `__builtin_cpu_is` function to detect an exact processor model. A compiler-agnostic way of writing CPU dispatch is to use `CPUID` instruction (x86-only), `getauxval(AT_HWCAP)` Linux system call, or `sysctlbyname` on macOS.
This example demonstrates the use of built-in functions that are available in GCC and Clang compilers. Besides detecting supported ISA extensions, there is a `__builtin_cpu_is` function to detect an exact processor model. A compiler-agnostic way of writing CPU dispatch is to use `CPUID` instruction (x86-only), `getauxval(AT_HWCAP)` Linux system call, or `sysctlbyname` on macOS.

You would typically see CPU dispatching constructs used to optimize only specific parts of the code, e.g., hot function or loop. Very often, these platform-specific implementations are written with compiler intrinsics [@sec:secIntrinsics] to generate desired instructions.

Even though CPU dispatching is a runtime check, its overhead is not high. Developers usually identify hardware capabilities at startup once and save it in some variable, so at runtime it becomes just a single branch, which is well-predicted. Perhaps a bigger concern about CPU dispatching is the maintenance cost. Every new specialized branch requires fine-tuning and validation.
Even though CPU dispatching is a runtime check, its overhead is not high. Developers usually identify hardware capabilities at startup once and save it in some variable, so at runtime, it becomes just a single branch, which is well-predicted. Perhaps a bigger concern about CPU dispatching is the maintenance cost. Every new specialized branch requires fine-tuning and validation.

### Instruction Latencies and Throughput

Besides ISA extensions, it's worth learning about the number and type of execution units in your processor. For instance, the number of loads, stores, divisions and multiplications a processor can issue every cycle. For most processors, this information is published by CPU vendors in corresponding technical manuals. However, information about latencies and throughput of specific instructions is not usually disclosed. Nevertheless, people have benchmarked individual instructions, which can be accessed online. For the latest Intel and AMD CPUs, latency, throughput, port usage, and the number of $\mu$ops for an instruction can be found at the [uops.info](https://uops.info/table.html)[^2] website. For the Apple M1 processor, similar data is accessible in [@AppleOptimizationGuide, Appendix A].[^6] Along with instructions latencies and throughput, developers have reverse-engineered other aspects of a microarchitecture such as the size of branch prediction history buffers, reorder buffer capacity, size of load/store buffers, and others. Since this is an unofficial source of data, you should take it with a grain of salt.
Besides ISA extensions, it's worth learning about the number and type of execution units in your processor. For instance, the number of loads, stores, divisions, and multiplications a processor can issue every cycle. For most processors, this information is published by CPU vendors in corresponding technical manuals. However, information about latencies and throughput of specific instructions is not usually disclosed. Nevertheless, people have benchmarked individual instructions, which can be accessed online. For the latest Intel and AMD CPUs, latency, throughput, port usage, and the number of $\mu$ops for an instruction can be found at the [uops.info](https://uops.info/table.html)[^2] website. For the Apple M1 processor, similar data is accessible in [@AppleOptimizationGuide, Appendix A].[^6] Along with instructions latencies and throughput, developers have reverse-engineered other aspects of a microarchitecture such as the size of branch prediction history buffers, reorder buffer capacity, size of load/store buffers, and others. Since this is an unofficial source of data, you should take it with a grain of salt.

Be very careful about making conclusions just on the instruction latency and throughput numbers. In many cases, instruction latencies are hidden by the out-of-order execution engine, and it may not matter if an instruction has a latency of 4 or 8 cycles. If it doesn't block forward progress, such instruction will be handled "in the background" without harming performance. However, the latency of an instruction becomes important when it stands on a critical dependency chain because it delays the execution of dependent operations.

In contrast, if you have a loop that performs a lot of _independent_ operations, you should focus on instruction throughput rather than latency. When operations are independent, they can be processed in parallel. In such a scenario, the critical factor is how many operations of a certain type can be executed per cycle, or *execution throughput*. There are also "in-between" scenarios, where both instruction latency and throughput may affect performance.

When you analyze machine code for one of your hot loops, you may find that multiple instructions are assigned to the same execution port. This situation is known as _execution port contention_. So the challenge is to find ways of substituting some of these instructions with the ones that are not assigned to the same port. For example on Intel processors, if you're heavily bottlenecked on `port5`, then you may find that two instructions on `port0` are better than one instruction on `port5`. Often it is not an easy task and it requires deep ISA and microarchitecture knowledge. When in struggle, seek help on specialized forums. Also, keep in mind that some of these things may change in future CPU generations, so consider using CPU dispatch to isolate the effect of your code changes.

In [@sec:FMAThroughput], we looked at one example, when the throughput of FMA instructions becomes critical. Now let's take a look at another example, involving FMA latency. In [@lst:FMAlatency] on the left, we have the `sqSum` function which computes a sum of every element squared. On the right, we present the corresponding machine code generated by Clang-18 when compiled with `-O3 -march=core-avx2`. Notice, we didn't use `-ffast-math`, perhaps because we want to maintain bit-exact results over multiple platforms. That's why the code was not autovectorized by the compiler.
In [@sec:FMAThroughput], we looked at one example, of when the throughput of FMA instructions becomes critical. Now let's take a look at another example, involving FMA latency. In [@lst:FMAlatency] on the left, we have the `sqSum` function which computes a sum of every element squared. On the right, we present the corresponding machine code generated by Clang-18 when compiled with `-O3 -march=core-avx2`. Notice, that we didn't use `-ffast-math`, perhaps because we want to maintain bit-exact results over multiple platforms. That's why the code was not autovectorized by the compiler.

Listing: FMA latency

Expand All @@ -88,7 +88,7 @@ float sqSum(float *a, int N) { │ .loop:
The code looks very logical; it uses fused multiply-add instruction to compute the product of one element of `a` (in `xmm1`) and then accumulates the result in `xmm0`. The problem is that there is a data dependency over `xmm0`. A processor cannot issue a new `vfmadd231ss` instruction until the previous one has finished since `xmm0` is both an input and an output of `vfmadd231ss`. The performance of this loop is bound by FMA latency, which in Intel's Alderlake equals 4 cycles.
You may think: "But wait, multiplications do not depend on each other." Yes, you're right, yet the whole FMA instruction needs to wait until all its inputs become available. So, in this case, fusing multiplication and addition actually hurts performance. We would be better off with two separate instructions. The `nanobench` experiment below proves that:
You may think: "But wait, multiplications do not depend on each other." Yes, you're right, yet the whole FMA instruction needs to wait until all its inputs become available. So, in this case, fusing multiplication and addition hurts performance. We would be better off with two separate instructions. The `nanobench` experiment below proves that:
```
# ran on Intel Core i7-1260P (Alderlake)
Expand All @@ -112,4 +112,4 @@ From this experiment, we know that if the compiler would not have decided to fus
[^2]: x86 instruction latency and throughput - [https://uops.info/table.html](https://uops.info/table.html)
[^4]: LLVM extensions to specify floating-point flags - [https://clang.llvm.org/docs/LanguageExtensions.html#extensions-to-specify-floating-point-flags](https://clang.llvm.org/docs/LanguageExtensions.html#extensions-to-specify-floating-point-flags)
[^5]: The debate also isn't interesting because after $\mu$ops conversion, x86 becomes a RISC-style micro-architecture. Complex instructions get broken down into simpler instructions.
[^6]: Also, there are instruction throughput and latency data collected via reverse-engineering experiments, such as in [https://dougallj.github.io/applecpu/firestorm-simd.html](https://dougallj.github.io/applecpu/firestorm-simd.html). Since this is an unofficial source of data, you should take it with a grain of salt.
[^6]: Also, there are instruction throughput and latency data collected via reverse-engineering experiments, such as in [https://dougallj.github.io/applecpu/firestorm-simd.html](https://dougallj.github.io/applecpu/firestorm-simd.html). Since this is an unofficial source of data, you should take it with a grain of salt.

0 comments on commit 58a0895

Please sign in to comment.