Skip to content

Commit

Permalink
Chapter 12 edits (#78)
Browse files Browse the repository at this point in the history
* 12-1: capitalize ARM extensions, number agreement, Alder Lake

* 11-2: small changes

* 12-7: Mib -> MiB
  • Loading branch information
dankamongmen authored Sep 26, 2024
1 parent 722f4ed commit bbc4c1c
Show file tree
Hide file tree
Showing 3 changed files with 16 additions and 16 deletions.
22 changes: 11 additions & 11 deletions chapters/12-Other-Tuning-Areas/12-1 CPU-Specific Optimizations.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ The major differences between x86 (considered as CISC) and RISC ISAs, such as AR
* x86 instructions are variable-length, while ARM and RISC-V instructions are fixed-length. This makes decoding x86 instructions more complex.
* x86 ISA has many addressing modes, while ARM and RISC-V have few addressing modes. Operands in ARM and RISC-V instructions are either registers or immediate values, while x86 instruction inputs can also come from memory. This bloats the number of x86 instructions but also allows for more powerful single instructions. For instance, ARM requires loading a memory location first, then performing the operation; x86 can do both in one instruction.

In addition to this, there are a few other differences that you should consider when optimizing for a specific microarchitecture. As of 2024, the most recent x86-64 ISA has 16 architectural general-purpose registers, while the latest ARMv8 and RV64 require a CPU to provide 32 general-purpose registers. Extra architectural registers reduce register spilling and hence reduce the number of loads/stores. Intel has announced a new extension called APX[^1] that will increase the number of registers to 32. There is also a difference in the memory page size between x86 and ARM. The default page size for x86 platforms is 4 KB, while most ARM systems (for example, Apple MacBooks) use a 16 KB page size, although both platforms support larger page sizes (see [@sec:ArchHugePages], and [@sec:secDTLB]). All these differences can affect the performance of your application when it becomes a bottleneck.
In addition to this, there are a few other differences that you should consider when optimizing for a specific microarchitecture. As of 2024, the most recent x86-64 ISA has 16 architectural general-purpose registers, while the latest ARMv8 and RV64 require a CPU to provide 32 general-purpose registers. Extra architectural registers reduce register spilling and hence reduce the number of loads/stores. Intel has announced a new extension called APX[^1] that will increase the number of registers to 32. There is also a difference in the memory page size between x86 and ARM. The default page size for x86 platforms is 4 KB, while most ARM systems (for example, Apple MacBooks) use a 16 KB page size, although both platforms support larger page sizes (see [@sec:ArchHugePages], and [@sec:secDTLB]). All these differences can affect the performance of your application when they become a bottleneck.

Although ISA differences *may* have a tangible impact on the performance of a specific application, numerous studies show that on average, differences between the two most popular ISAs, namely x86 and ARM, don't have a measurable performance impact. Throughout this book, we carefully avoided advertisements of any products (e.g., Intel vs. AMD vs. Apple) and any religious ISA debates (x86 vs. ARM vs. RISC-V).[^5] Below are some references that we hope will close the debate:

Expand All @@ -20,7 +20,7 @@ Nevertheless, this doesn't remove the value of architecture-specific optimizatio

### ISA Extensions

ISA evolution has been continuous, it has focused on accelerating specialized workloads, such as cryptography, AI, multimedia, and others. Utilizing ISA extensions often results in lucrative performance improvements. Developers keep finding smart ways to leverage these extensions in general-purpose applications. So, even if you're outside of one of these highly specialized domains, you might still benefit from using ISA extensions.
ISA evolution has been continuous. It has focused on accelerating specialized workloads, such as cryptography, AI, multimedia, and others. Utilizing ISA extensions often results in lucrative performance improvements. Developers keep finding smart ways to leverage these extensions in general-purpose applications. So, even if you're outside of one of these highly specialized domains, you might still benefit from using ISA extensions.

It's not possible to learn about all specific instructions. But we suggest you familiarize yourself with major ISA extensions available on your target platform. For example, if you are developing an AI application that uses `fp16` data types, and you target one of the modern ARM processors, make sure that your program's machine code contains corresponding `fp16` ISA extensions. If you're developing encryption/decryption software, check if it utilizes crypto extensions of your target ISA. And so on.

Expand All @@ -36,12 +36,12 @@ Here is a list of some notable x86 ISA extensions:

Here is a list of some notable ARM ISA extensions:

* asimd: also known as neon, provides SIMD instructions for floating-point and integer operations.
* aes/sha1/sha2/sha3/sha512/crc32: provide instructions for encryption, hashing, and checksumming.
* fp16/bf16: provide 16-bit half-precision and `Bfloat16` floating-point instructions.
* dotprod: support for dot product instructions for accelerating machine learning workloads.
* sve: enables scalable vector length instructions.
* sme: Scalable Matrix Extension for accelerating matrix multiplication.
* Advanced SIMD: also known as NEON, provides arithmetic SIMD instructions.
* Cryptographic Instructions: provide instructions for encryption, hashing, and checksumming.
* FP16/BF16: provide 16-bit half-precision and `Bfloat16` floating-point instructions.
* UDOT/SDOT: support for dot product instructions for accelerating machine learning workloads.
* SVE: enables scalable vector length instructions.
* SME: Scalable Matrix Extension for accelerating matrix multiplication.

When compiling your applications, make sure to enable the necessary compiler flags to activate required ISA extensions. On GCC and Clang compilers use the `-march` option. For example, `-march=native` will activate ISA features of your host system, i.e., on which you run the compilation. Or you can include a specific version of ISA, e.g., `-march=armv8.6-a`. On the MSVC compiler, use the `/arch` option, e.g., `/arch:AVX2`.

Expand All @@ -65,7 +65,7 @@ Even though CPU dispatching is a runtime check, its overhead is not high. Develo

### Instruction Latencies and Throughput

Besides ISA extensions, it's worth learning about the number and type of execution units in your processor. For instance, the number of loads, stores, divisions, and multiplications a processor can issue every cycle. For most processors, this information is published by CPU vendors in corresponding technical manuals. However, information about latencies and throughput of specific instructions is not usually disclosed. Nevertheless, people have benchmarked individual instructions, which can be accessed online. For the latest Intel and AMD CPUs, latency, throughput, port usage, and the number of $\mu$ops for an instruction can be found at the [uops.info](https://uops.info/table.html)[^2] website. For the Apple M1 processor, similar data is accessible in [@AppleOptimizationGuide, Appendix A].[^6] Along with instructions latencies and throughput, developers have reverse-engineered other aspects of a microarchitecture such as the size of branch prediction history buffers, reorder buffer capacity, size of load/store buffers, and others. Since this is an unofficial source of data, you should take it with a grain of salt.
Besides ISA extensions, it's worth learning about the number and type of execution units in your processor (e.g., the number of loads, stores, divisions, and multiplications a processor can issue every cycle). For most processors, this information is published by CPU vendors in corresponding technical manuals. However, information about latencies and throughput of specific instructions is not usually disclosed. Nevertheless, people have benchmarked individual instructions, which can be accessed online. For the latest Intel and AMD CPUs, latency, throughput, port usage, and the number of $\mu$ops for an instruction can be found at the [uops.info](https://uops.info/table.html)[^2] website. For the Apple M1 processor, similar data is accessible in [@AppleOptimizationGuide, Appendix A].[^6] Along with instruction latencies and throughput, developers have reverse-engineered other aspects of a microarchitecture such as the size of branch prediction history buffers, reorder buffer capacity, size of load/store buffers, and others. Since this is an unofficial source of data, you should take it with a grain of salt.

Be very careful about making conclusions just on the instruction latency and throughput numbers. In many cases, instruction latencies are hidden by the out-of-order execution engine, and it may not matter if an instruction has a latency of 4 or 8 cycles. If it doesn't block forward progress, such instruction will be handled "in the background" without harming performance. However, the latency of an instruction becomes important when it stands on a critical dependency chain because it delays the execution of dependent operations.

Expand All @@ -88,12 +88,12 @@ float sqSum(float *a, int N) { │ .loop:
} │ jne .loop
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The code looks very logical; it uses fused multiply-add instruction to compute the product of one element of `a` (in `xmm1`) and then accumulates the result in `xmm0`. The problem is that there is a data dependency over `xmm0`. A processor cannot issue a new `vfmadd231ss` instruction until the previous one has finished since `xmm0` is both an input and an output of `vfmadd231ss`. The performance of this loop is bound by FMA latency, which in Intel's Alderlake equals 4 cycles.
The code looks very logical; it uses fused multiply-add instruction to compute the product of one element of `a` (in `xmm1`) and then accumulates the result in `xmm0`. The problem is that there is a data dependency over `xmm0`. A processor cannot issue a new `vfmadd231ss` instruction until the previous one has finished since `xmm0` is both an input and an output of `vfmadd231ss`. The performance of this loop is bound by FMA latency, which in Intel's Alder Lake equals 4 cycles.
You may think: "But wait, multiplications do not depend on each other." Yes, you're right, yet the whole FMA instruction needs to wait until all its inputs become available. So, in this case, fusing multiplication and addition hurts performance. We would be better off with two separate instructions. The `nanobench` experiment below proves that:
```
# ran on Intel Core i7-1260P (Alderlake)
# ran on Intel Core i7-1260P (Alder Lake)
$ sudo ./kernel-nanoBench.sh -f -basic │ $ sudo ./kernel-nanoBench.sh -f -basic
-loop 100 -unroll 1000 │ -loop 100 -unroll 1000
-warm_up_count 10 -asm " │ -warm_up_count 10 -asm "
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
## Microarchitecture-Specific Performance Issues {#sec:UarchSpecificIssues}

In this section, we will discuss some common microarchitecture-specific issues that can be attributed to the majority of modern processors. We call them microarchitecture-specific because they are caused by the way a particular microarchitecture feature is implemented. These issues are very specific and do not frequently appear as a major performance bottleneck. Typically, they are diluted among other more significant performance problems. Thus, these microarchitecture-specific performance issues are considered corner cases and are less known than the other issues that we discussed in the book. Nevertheless, they can cause very undesirable performance penalties. Note that the impact of a particular problem can be more/less pronounced on one platform than another. Also, keep in mind, that there are other microarchitecture-specific issues that we don't cover in this book.
In this section, we will discuss some common microarchitecture-specific issues that can be attributed to the majority of modern processors. We call them microarchitecture-specific because they are caused by the way a particular microarchitecture feature is implemented. These issues are very specific and do not frequently appear as a major performance bottleneck. Typically, they are diluted among other more significant performance problems. Thus, these microarchitecture-specific performance issues are considered corner cases and are less known than the other issues that we discussed in the book. Nevertheless, they can cause very undesirable performance penalties. Note that the impact of a particular problem can be more/less pronounced on one platform than another. Also, keep in mind that there are other microarchitecture-specific issues that we don't cover in this book.

### Memory Order Violations {#sec:MemoryOrderViolations}

We introduced the concept of memory ordering in [@sec:uarchLSU]. Memory reordering is a crucial aspect of modern CPUs, as it enables them to execute memory requests in parallel and out-of-order. The key element in load/store reordering is memory disambiguation, which predicts if it is safe to let loads go ahead of earlier stores. Since memory disambiguation is speculative, it can lead to performance issues if not handled properly.

Consider an example in [@lst:MemOrderViolation], on the left. This code snippet calculates a histogram of an 8-bit grayscale image, i.e., calculate how many times a certain color appears in the image. Besides countless other places, this code can be found in Otsu's thresholding algorithm[^1] which is used to convert a grayscale image to a binary image. Since the input image is 8-bit grayscale, there are only 256 different colors.
Consider an example in [@lst:MemOrderViolation], on the left. This code snippet calculates a histogram of an 8-bit grayscale image, i.e., how many times a certain color appears in the image. Besides countless other places, this code can be found in Otsu's thresholding algorithm[^1] which is used to convert a grayscale image to a binary image. Since the input image is 8-bit grayscale, there are only 256 different colors.

For each pixel on an image, you need to read the current histogram count of the color of the pixel, increment it, and store it back. This is a classic read-modify-write dependency through the memory. Imagine we have the following consecutive pixels in the image: `0xFF 0xFF 0x00 0xFF 0xFF ...` and so on. The loaded value of the histogram count for pixel 1 comes from the result of the previous iteration. But the histogram count for pixel 2 comes from memory; it is independent and can be reordered. But then again, the histogram count for pixel 3 is dependent on the result of processing pixel 1, and so on.

Expand All @@ -31,7 +31,7 @@ for (int i = 0; i < N; ++i) => hist2.fill(0);
hist1[i] += hist2[i];
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Recall from [@sec:uarchLSU], that the processor doesn't necessarily know about a potential store-to-load forwarding, so it has to make a prediction. If it correctly predicts a memory order violation between two updates of color `0xFF`, then these accesses will be serialized. The performance will not be great, but it is the best we could hope for with the initial code. On the contrary, if the processor predicts that there is no memory order violation, it will speculatively let the two updates run in parallel. Later it will recognize the mistake, flush the pipeline, and re-execute the youngest of the two updates. This is very hurtful for performance.
Recall from [@sec:uarchLSU] that the processor doesn't necessarily know about a potential store-to-load forwarding, so it has to make a prediction. If it correctly predicts a memory order violation between two updates of color `0xFF`, then these accesses will be serialized. The performance will not be great, but it is the best we could hope for with the initial code. On the contrary, if the processor predicts that there is no memory order violation, it will speculatively let the two updates run in parallel. Later it will recognize the mistake, flush the pipeline, and re-execute the youngest of the two updates. This is very hurtful for performance.
Performance will greatly depend on the color patterns of the input image. Images with long sequences of pixels with the same color will have worse performance than images where colors don't repeat often. The performance of the initial version will be good as long as the distance between two pixels of the same color is long enough. The phrase "long enough" in this context is determined by the size of the out-of-order instruction window. Repeating read-modify-writes of the same color may trigger ordering violations if they occur within a few loop iterations of each other, but not if they occur more than a hundred loop iterations apart.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ Figure @fig:milan7313P shows the clustered memory hierarchy of the AMD Milan 731

![The clustered memory hierarchy of the AMD Milan 7313P processor.](../../img/other-tuning/Milan7313P.png){#fig:milan7313P width=80%}

Although there is a total of 128 MiB of LLC, the four cores of a CCX cannot store cache lines in an LLC other than their own 32 MiB LLC (32 MiB/CCX x 4 CCX). Since we will be running single-threaded benchmarks, we can focus on a single CCX. The size of LLC in our experiments will vary from 0 to 32 Mib with steps of 2 Mib.
Although there is a total of 128 MiB of LLC, the four cores of a CCX cannot store cache lines in an LLC other than their own 32 MiB LLC (32 MiB/CCX x 4 CCX). Since we will be running single-threaded benchmarks, we can focus on a single CCX. The size of LLC in our experiments will vary from 0 to 32 MiB with steps of 2 MiB.

### Workload: SPEC CPU2017 {.unlisted .unnumbered}

Expand Down Expand Up @@ -122,7 +122,7 @@ The methodology used in this case study is described in more detail in [@Balance

### Results {.unlisted .unnumbered}

We run a set of SPEC CPU2017 benchmarks *alone* in the system using only one instance and a single hardware thread. We repeat those runs while changing the available LLC size from 0 to 32 MiB with 2 MiB steps. Figure @fig:characterization_llc shows in graphs, from left to right, CPI, DMPKI, and MPKI for each assigned LLC size. For the CPI chart, a lower value on the Y-axis means better performance. Also, since the frequency on the system is fixed, the CPI chart is reflective of absolute scores. For example, `520.omnetpp` (dotted line) with 32 MiB LLC is 2.5 times faster than with 0 Mib LLC.
We run a set of SPEC CPU2017 benchmarks *alone* in the system using only one instance and a single hardware thread. We repeat those runs while changing the available LLC size from 0 to 32 MiB with 2 MiB steps. Figure @fig:characterization_llc shows in graphs, from left to right, CPI, DMPKI, and MPKI for each assigned LLC size. For the CPI chart, a lower value on the Y-axis means better performance. Also, since the frequency on the system is fixed, the CPI chart is reflective of absolute scores. For example, `520.omnetpp` (dotted line) with 32 MiB LLC is 2.5 times faster than with 0 MiB LLC.

For the DMPKI and MPKI charts, the lower the value on the Y-axis, the better. Three lines that correspond to `503.bwaves` (solid), `520.omnetpp` (dotted), and `554.roms` (dashed), represent the three main trends observed in all applications. We do not show the rest of the benchmarks.

Expand Down

0 comments on commit bbc4c1c

Please sign in to comment.