Skip to content

Commit

Permalink
[Proofreading] fixed all μ symbols
Browse files Browse the repository at this point in the history
dendibakh committed Jan 6, 2024
1 parent a72b9f5 commit da56df8
Showing 12 changed files with 33 additions and 33 deletions.
Original file line number Diff line number Diff line change
@@ -23,7 +23,7 @@ Two versions of machine code layout for the snippet of code above.

Which layout is better? Well, it depends on whether `cond` is usually true or false. If `cond` is usually true, then we would better choose the default layout because otherwise, we would be doing two jumps instead of one. Also, in the general case, if `coldFunc` is a relatively small function, we would want to have it inlined. However, in this particular example, we know that coldFunc is an error handling function and is likely not executed very often. By choosing layout @fig:BB_better, we maintain fall through between hot pieces of the code and convert taken branch into not taken one.

There are a few reasons why the layout presented in Figure @fig:BB_better performs better. First of all, layout in Figure @fig:BB_better makes better use of the instruction and μop-cache (DSB, see [@sec:uarchFE]). With all hot code contiguous, there is no cache line fragmentation: all the cache lines in the L1I-cache are used by hot code. The same is true for the μop-cache since it caches based on the underlying code layout as well. Secondly, taken branches are also more expensive for the fetch unit. The Front-End of a CPU fetches contiguous chunks of bytes, so every taken jump means the bytes after the jump are useless. This reduces the maximum effective fetch throughput. Finally, on some architectures, not taken branches are fundamentally cheaper than taken. For instance, Intel Skylake CPUs can execute two untaken branches per cycle but only one taken branch every two cycles.[^2]
There are a few reasons why the layout presented in Figure @fig:BB_better performs better. First of all, layout in Figure @fig:BB_better makes better use of the instruction and $\mu$op-cache (DSB, see [@sec:uarchFE]). With all hot code contiguous, there is no cache line fragmentation: all the cache lines in the L1I-cache are used by hot code. The same is true for the $\mu$op-cache since it caches based on the underlying code layout as well. Secondly, taken branches are also more expensive for the fetch unit. The Front-End of a CPU fetches contiguous chunks of bytes, so every taken jump means the bytes after the jump are useless. This reduces the maximum effective fetch throughput. Finally, on some architectures, not taken branches are fundamentally cheaper than taken. For instance, Intel Skylake CPUs can execute two untaken branches per cycle but only one taken branch every two cycles.[^2]

To suggest a compiler to generate an improved version of the machine code layout, one can provide a hint using `[[likely]]` and `[[unlikely]]` attributes, which is available since C++20. The code that uses this hint will look like this:

Original file line number Diff line number Diff line change
@@ -21,7 +21,7 @@ assert(std::fpclassify(norm) != FP_SUBNORMAL);
Without subnormal values, the subtraction of two FP values `a - b` can underflow and produce zero even though the values are not equal. Subnormal values allow calculations to gradually lose precision without rounding the result to zero. Although, it comes with a cost as we shall see later. Subnormal values also may occur in production software when a value keeps decreasing in a loop with subtraction or division.
From the hardware perspective, handling subnormals is more difficult than handling normal FP values as it requires special treatment and generally, is considered as an exceptional situation. The application will not crash, but it will get a performance penalty. Calculations that produce or consume subnormal numbers are much slower than similar calculations on normal numbers and can run 10 times slower or more. For instance, Intel processors currently handle operations on subnormals with a microcode *assist*. When a processor recognizes subnormal FP value, Microcode Sequencer (MSROM) will provide the necessary microoperations (μops) to compute the result.
From the hardware perspective, handling subnormals is more difficult than handling normal FP values as it requires special treatment and generally, is considered as an exceptional situation. The application will not crash, but it will get a performance penalty. Calculations that produce or consume subnormal numbers are much slower than similar calculations on normal numbers and can run 10 times slower or more. For instance, Intel processors currently handle operations on subnormals with a microcode *assist*. When a processor recognizes subnormal FP value, Microcode Sequencer (MSROM) will provide the necessary microoperations ($\mu$ops) to compute the result.
In many cases, subnormal values are generated naturally by the algorithm and thus are unavoidable. Luckily, most processors give an option to flush subnormal value to zero and not generate subnormals in the first place. Indeed, many users rather choose to have slightly less accurate results rather than slowing down the code. Although, the opposite argument could be made for finance software: if you flush a subnormal value to zero, you lose precision and cannot scale it up as it will remain zero. This could make some customers angry.
2 changes: 1 addition & 1 deletion chapters/16-Glossary/16-0 Glossary.md
Original file line number Diff line number Diff line change
@@ -92,7 +92,7 @@

\textbf{TSC} Time Stamp Counter

\textbf{μop} MicroOperation
\textbf{$\mu$op} MicroOperation

\end{multicols}

14 changes: 7 additions & 7 deletions chapters/3-CPU-Microarchitecture/3-8 Modern CPU design.md
Original file line number Diff line number Diff line change
@@ -18,21 +18,21 @@ Technically, instruction fetch is the first stage to execute an instruction. But

The heart of the BPU is a branch target buffer (BTB) with 12K entries, which keeps the information about branches and their targets which is used by the prediction algorithms. Every cycle, the BPU generates next address to fetch and passes it to the CPU Front-End.

The CPU Front-End fetches 32 bytes per cycle of x86 instructions from the L1 I-cache. This is shared among the two threads, so each thread gets 32 bytes every other cycle. These are complex, variable-length x86 instructions. First, the pre-decode determines and marks the boundaries of the variable instructions by inspecting the instruction. In x86, the instruction length can range from 1-byte to 15-bytes instructions. This stage also identifies branch instructions. The pre-decode stage moves up to 6 instructions (also referred to as Macro Instructions) to the instruction queue (not shown on the block diagram) that is split between the two threads. The instruction queue also supports a macro-op fusion unit that detects that two macroinstructions can be fused into a single micro operation (μop). This optimization saves bandwidth in the rest of the pipeline.
The CPU Front-End fetches 32 bytes per cycle of x86 instructions from the L1 I-cache. This is shared among the two threads, so each thread gets 32 bytes every other cycle. These are complex, variable-length x86 instructions. First, the pre-decode determines and marks the boundaries of the variable instructions by inspecting the instruction. In x86, the instruction length can range from 1-byte to 15-bytes instructions. This stage also identifies branch instructions. The pre-decode stage moves up to 6 instructions (also referred to as Macro Instructions) to the instruction queue (not shown on the block diagram) that is split between the two threads. The instruction queue also supports a macro-op fusion unit that detects that two macroinstructions can be fused into a single micro operation ($\mu$op). This optimization saves bandwidth in the rest of the pipeline.

Later, up to six pre-decoded instructions are sent from the instruction queue to the decoder unit every cycle. The two SMT threads alternate every cycle to access this interface. The 6-way decoder converts the complex macro-Ops into fixed-length μops. Decoded μops are queued into the Instruction Decode Queue (IDQ), labeled as "μop Queue" on the diagram.
Later, up to six pre-decoded instructions are sent from the instruction queue to the decoder unit every cycle. The two SMT threads alternate every cycle to access this interface. The 6-way decoder converts the complex macro-Ops into fixed-length $\mu$ops. Decoded $\mu$ops are queued into the Instruction Decode Queue (IDQ), labeled as "$\mu$op Queue" on the diagram.

A major performance-boosting feature of the front-end is the Decoded Stream Buffer (DSB) or the μop Cache. The motivation is to cache the macro-ops to μops conversion in a separate structure that works in parallel with the L1 I-cache. When the BPU generates new address to fetch, the DSB is also checked to see if the μops translations are already available in the DSB. Frequently occurring macro-ops will hit in the DSB, and the pipeline will avoid repeating the expensive pre-decode and decode operations for the 32 bytes bundle. The DSB can provide eight μops per cycle and can hold up to 4K entries.
A major performance-boosting feature of the front-end is the Decoded Stream Buffer (DSB) or the $\mu$op Cache. The motivation is to cache the macro-ops to $\mu$ops conversion in a separate structure that works in parallel with the L1 I-cache. When the BPU generates new address to fetch, the DSB is also checked to see if the $\mu$ops translations are already available in the DSB. Frequently occurring macro-ops will hit in the DSB, and the pipeline will avoid repeating the expensive pre-decode and decode operations for the 32 bytes bundle. The DSB can provide eight $\mu$ops per cycle and can hold up to 4K entries.

Some very complicated instructions may require more μops than decoders can handle. μops for such instruction are served from Microcode Sequencer (MSROM). Examples of such instructions include HW operation support for string manipulation, encryption, synchronization, and others. Also, MSROM keeps the microcode operations to handle exceptional situations like branch misprediction (which requires a pipeline flush), floating-point assist (e.g., when an instruction operates with a denormalized floating-point value), and others. MSROM can push up to 4 μops per cycle into the IDQ.
Some very complicated instructions may require more $\mu$ops than decoders can handle. $\mu$ops for such instruction are served from Microcode Sequencer (MSROM). Examples of such instructions include HW operation support for string manipulation, encryption, synchronization, and others. Also, MSROM keeps the microcode operations to handle exceptional situations like branch misprediction (which requires a pipeline flush), floating-point assist (e.g., when an instruction operates with a denormalized floating-point value), and others. MSROM can push up to 4 $\mu$ops per cycle into the IDQ.

The Instruction Decode Queue (IDQ) provides the interface between the in-order front-end and the out-of-order backend. IDQ queues up the μops in order. The IDQ can hold 144 μops per logical processor in single thread mode, or 72 μops per thread when SMT is active. This is where the in-order CPU Front-End finishes and the out-of-order CPU Back-End starts.
The Instruction Decode Queue (IDQ) provides the interface between the in-order front-end and the out-of-order backend. IDQ queues up the $\mu$ops in order. The IDQ can hold 144 $\mu$ops per logical processor in single thread mode, or 72 $\mu$ops per thread when SMT is active. This is where the in-order CPU Front-End finishes and the out-of-order CPU Back-End starts.

### CPU Back-End {#sec:uarchBE}

The CPU Back-End employs an OOO engine that executes instructions and stores results. The heart of the CPU backend is the 512 entry ReOrder buffer (ROB). This unit is reffered as "Allocate / Rename" on the diagram. It serves a few purposes. First, it provides register renaming. There are only 16 general-purpose integer and 32 vector/SIMD architectural registers, however, the number of physical registers is much higher.[^1] Physical registers are located in a structure called physical register file (PRF). The mappings from architecture-visible registers to the physical registers are kept in the register alias table (RAT).

Second, ROB allocates execution resources. When an instruction enters the ROB, a new entry gets allocated and resources are assigned to it, mainly execution port and the output physical register. ROB can allocate up to 6 μops per cycle.
Second, ROB allocates execution resources. When an instruction enters the ROB, a new entry gets allocated and resources are assigned to it, mainly execution port and the output physical register. ROB can allocate up to 6 $\mu$ops per cycle.

Third, ROB tracks the speculative execution. When the instruction finished its execution its status gets updated and it stays there until the previous instructions also finish. It' done that way because instructions are always retired in program order. Once the instruction retires, its ROB entry gets deallocated and results of the instruction become visible. The retiring stage is wider than the allocation: ROB can retire 8 instruction per cycle.

@@ -43,7 +43,7 @@ There are certain operations which processors handle in a specific manner, often
* **NOP instruction**: `NOP` is often used for padding or alignment purposes. It simply gets marked as completed without allocating it into the reservation station.
* **Other bypases**: CPU architects also optimized certain arithmetical operations. For example, multiplying any number by one will always gives the same number. The same goes for dividing any number by one. Multiplying any number by zero always gives the same number, etc. Some CPUs can recognize such cases in runtime and run them with shorter latency than regular multiplication or divide.

The "Scheduler / Reservation Station" (RS) is the structure that tracks the availability of all resources for a given μop and dispatches the μop to the assigned port once it is ready. When an instruction enters the RS, scheduler starts tracking its data dependencies. Once all the source operands become available, the RS tries to dispatch it to a free execution port. The RS has fewer entries than the ROB. It can dispatch up to 6 μops per cycle.
The "Scheduler / Reservation Station" (RS) is the structure that tracks the availability of all resources for a given $\mu$op and dispatches the $\mu$op to the assigned port once it is ready. When an instruction enters the RS, scheduler starts tracking its data dependencies. Once all the source operands become available, the RS tries to dispatch it to a free execution port. The RS has fewer entries than the ROB. It can dispatch up to 6 $\mu$ops per cycle.

As shown in Figure @fig:Goldencove_diag, there are 12 execution ports:

Original file line number Diff line number Diff line change
@@ -8,4 +8,4 @@ Like many engineering disciplines, Performance Analysis is quite heavy on using

Since we mentioned Linux `perf`, let us briefly introduce the tool as we have many examples of using it in this and later chapters. Linux `perf` is a performance profiler that you can use to find hotspots in a program, collect various low-level CPU performance events, analyze call stacks, and many other things. We will use Linux `perf` extensively throughout the book as it is one of the most popular performance analysis tools. Another reason why we prefer showcasing Linux `perf` is because it is open-sourced, which allows enthusiastic readers to explore the mechanics of what's going on inside a modern profiling tool. This is especially useful for learning concepts presented in this book because GUI-based tools, like Intel® VTune™ Profiler, tend to hide all the complexity. We will have a more detailed overview of Linux `perf` in chapter 7.

This chapter is a gentle introduction to the basic terminology and metrics used in performance analysis. We will first define the basic things like retired/executed instructions, IPC/CPI, μops, core/reference clocks, cache misses and branch mispredictions. Then we will see how to measure the memory latency and bandwidth of a system and introduce some more advanced metrics. In the end, we will benchmark four industry workloads and look at the collected metrics.
This chapter is a gentle introduction to the basic terminology and metrics used in performance analysis. We will first define the basic things like retired/executed instructions, IPC/CPI, $\mu$ops, core/reference clocks, cache misses and branch mispredictions. Then we will see how to measure the memory latency and bandwidth of a system and introduce some more advanced metrics. In the end, we will benchmark four industry workloads and look at the collected metrics.
Original file line number Diff line number Diff line change
@@ -8,7 +8,7 @@ Modern processors typically execute more instructions than the program flow requ

There is an exception. Certain instructions are recognized as idioms and are resolved without actual execution. An example of it can be NOP, move elimination and zeroing, see [@sec:uarchBE]. Such instructions do not require an execution unit but are still retired. So, theoretically, there could be a case when the number of retired instructions is higher than the number of executed instructions.

There is a fixed performance counter (PMC) in most modern processors that collects the number of retired instructions. There is no performance event to collect executed instructions, though there is a way to collect executed and retired *μops* as we shall see soon. The number of retired instructions can be easily obtained with Linux `perf` by running:
There is a fixed performance counter (PMC) in most modern processors that collects the number of retired instructions. There is no performance event to collect executed instructions, though there is a way to collect executed and retired *$\mu$ops* as we shall see soon. The number of retired instructions can be easily obtained with Linux `perf` by running:

```bash
$ perf stat -e instructions ./a.exe
2 changes: 1 addition & 1 deletion chapters/4-Terminology-And-Metrics/4-16 Chapter summary.md
Original file line number Diff line number Diff line change
@@ -4,7 +4,7 @@ typora-root-url: ..\..\img

## Chapter Summary {.unlisted .unnumbered}

* In this chapter, we introduced the basic metrics in performance analysis such as retired/executed instructions, CPU utilization, IPC/CPI, μops, pipeline slots, core/reference clocks, cache misses and branch mispredictions. We showed how each of these metrics can be collected with Linux perf.
* In this chapter, we introduced the basic metrics in performance analysis such as retired/executed instructions, CPU utilization, IPC/CPI, $\mu$ops, pipeline slots, core/reference clocks, cache misses and branch mispredictions. We showed how each of these metrics can be collected with Linux perf.
* For more advanced performance analysis, there are many derivative metrics that one can collect. For instance, MPKI (misses per kilo instructions), Ip* (instructions per function call, branch, load, etc), ILP, MLP and others. The case study in this chapter shows how we can get actionable insights from analyzing these metrics. Although, be carefull about drawing conclusions just by looking at the aggregate numbers. Don't fall in the trap of "excel performance engineering", i.e. only collect performance metrics and never look at the code. Always seek for a second source of data (e.g. performance profiles, discussed later) that will confirm your hypothesis.
* Memory bandwidth and latency are crucial factors in performance of many production SW packages nowadays, including AI, HPC, databases, and many general-purpose applications. Memory bandwidth depends on the DRAM speed (in MT/s) and the number of memory channels. Modern high-end server platforms have 8-12 memory channels and can reach up to 500 GB/s for the whole system and up to 50 GB/s in single-threaded mode. Memory latency nowadays doesn't change a lot, in fact it is getting slightly worse with new DDR4 and DDR5 generations. Majority of systems fall in the range of 70-110 ns per memory access.

26 changes: 13 additions & 13 deletions chapters/4-Terminology-And-Metrics/4-4 UOP.md
Original file line number Diff line number Diff line change
@@ -2,47 +2,47 @@
typora-root-url: ..\..\img
---

## μops (micro-ops) {#sec:sec_UOP}
## Micro-ops {#sec:sec_UOP}

Microprocessors with the x86 architecture translate complex CISC-like instructions into simple RISC-like microoperations, abbreviated as µops or μops. A simple addition instruction such as `ADD rax, rbx` generates only one µop, while more complex instruction like `ADD rax, [mem]` may generate two: one for reading from `mem` memory location into a temporary (un-named) register, and one for adding it to the `rax` register. The instruction `ADD [mem], rax` generates three μops: one for reading from memory, one for adding, and one for writing the result back to memory.
Microprocessors with the x86 architecture translate complex CISC-like instructions into simple RISC-like microoperations, abbreviated as $\mu$ops or $\mu$ops. A simple addition instruction such as `ADD rax, rbx` generates only one $\mu$op, while more complex instruction like `ADD rax, [mem]` may generate two: one for reading from `mem` memory location into a temporary (un-named) register, and one for adding it to the `rax` register. The instruction `ADD [mem], rax` generates three $\mu$ops: one for reading from memory, one for adding, and one for writing the result back to memory.

The main advantage of splitting instructions into micro operations is that μops can be executed:
The main advantage of splitting instructions into micro operations is that $\mu$ops can be executed:

* **Out of order**: consider `PUSH rbx` instruction, that decrements the stack pointer by 8 bytes and then stores the source operand on the top of the stack. Suppose that `PUSH rbx` is "cracked" into two dependent micro operations after decode:
```
SUB rsp, 8
STORE [rsp], rbx
```
Often, function prologue saves multiple registers using `PUSH` instructions. In our case, the next `PUSH` instruction can start executing after the `SUB` μop of the previous `PUSH` instruction finishes, and doesn't have to wait for the `STORE` μop, which can now go asynchronously.
Often, function prologue saves multiple registers using `PUSH` instructions. In our case, the next `PUSH` instruction can start executing after the `SUB` $\mu$op of the previous `PUSH` instruction finishes, and doesn't have to wait for the `STORE` $\mu$op, which can now go asynchronously.

* **In parallel**: consider `HADDPD xmm1, xmm2` instruction, that will sum up (reduce) two double precision floating point values in both `xmm1` and `xmm2` and store two results in `xmm1` as follows:
```
xmm1[63:0] = xmm2[127:64] + xmm2[63:0]
xmm1[127:64] = xmm1[127:64] + xmm1[63:0]
```
One way to microcode this instruction would be to do the following: 1) reduce `xmm2` and store the result in `xmm_tmp1[63:0]`, 2) reduce `xmm1` and store the result in `xmm_tmp2[63:0]`, 3) merge `xmm_tmp1` and `xmm_tmp2` into `xmm1`. Three μops in total. Notice that steps 1) and 2) are independent and thus can be done in parallel.
One way to microcode this instruction would be to do the following: 1) reduce `xmm2` and store the result in `xmm_tmp1[63:0]`, 2) reduce `xmm1` and store the result in `xmm_tmp2[63:0]`, 3) merge `xmm_tmp1` and `xmm_tmp2` into `xmm1`. Three $\mu$ops in total. Notice that steps 1) and 2) are independent and thus can be done in parallel.

Even though we were just talking about how instructions are split into smaller pieces, sometimes, μops can also be fused together. There are two types of fusion in modern CPUs:
Even though we were just talking about how instructions are split into smaller pieces, sometimes, $\mu$ops can also be fused together. There are two types of fusion in modern CPUs:

* **Microfusion**: fuse μops from the same machine instruction. Microfusion can only be applied to two types of combinations: memory write operations and read-modify operations. For example:
* **Microfusion**: fuse $\mu$ops from the same machine instruction. Microfusion can only be applied to two types of combinations: memory write operations and read-modify operations. For example:

```bash
add eax, [mem]
```
There are two μops in this instruction: 1) read the memory location `mem`, and 2) add it to `eax`. With microfusion, two μops are fused into one at the decoding step.
There are two $\mu$ops in this instruction: 1) read the memory location `mem`, and 2) add it to `eax`. With microfusion, two $\mu$ops are fused into one at the decoding step.

* **Macrofusion**: fuse μops from different machine instructions. The decoders can fuse arithmetic or logic instruction with a subsequent conditional jump instruction into a single compute-and-branch µop in certain cases. For example:
* **Macrofusion**: fuse $\mu$ops from different machine instructions. The decoders can fuse arithmetic or logic instruction with a subsequent conditional jump instruction into a single compute-and-branch $\mu$op in certain cases. For example:

```bash
.loop:
dec rdi
jnz .loop
```
With macrofusion, wwo μops from `DEC` and `JNZ` instructions are fused into one.
With macrofusion, wwo $\mu$ops from `DEC` and `JNZ` instructions are fused into one.

Both Micro- and Macrofusion save bandwidth in all stages of the pipeline from decoding to retirement. The fused operations share a single entry in the reorder buffer (ROB). The capacity of the ROB is utilized better when a fused μop uses only one entry. Such fused ROB entry is later dispatched to two different execution ports but is retired again as a single unit. Readers can learn more about μop fusion in [@fogMicroarchitecture].
Both Micro- and Macrofusion save bandwidth in all stages of the pipeline from decoding to retirement. The fused operations share a single entry in the reorder buffer (ROB). The capacity of the ROB is utilized better when a fused $\mu$op uses only one entry. Such fused ROB entry is later dispatched to two different execution ports but is retired again as a single unit. Readers can learn more about $\mu$op fusion in [@fogMicroarchitecture].

To collect the number of issued, executed, and retired μops for an application, you can use Linux `perf` as follows:
To collect the number of issued, executed, and retired $\mu$ops for an application, you can use Linux `perf` as follows:

```bash
$ perf stat -e uops_issued.any,uops_executed.thread,uops_retired.slots -- ./a.exe
@@ -51,6 +51,6 @@ $ perf stat -e uops_issued.any,uops_executed.thread,uops_retired.slots -- ./a.ex
2557884 uops_retired.slots
```

The way instructions are split into micro operations may vary across CPU generations. Usually, the lower number of μops used for an instruction means that HW has a better support for it and is likely to have lower latency and higher throughput. For the latest Intel and AMD CPUs, the vast majority of instructions generate exactly one μop. Latency, throughput, port usage, and the number of μops for x86 instructions on recent microarchitectures can be found at the [uops.info](https://uops.info/table.html)[^1] website.
The way instructions are split into micro operations may vary across CPU generations. Usually, the lower number of $\mu$ops used for an instruction means that HW has a better support for it and is likely to have lower latency and higher throughput. For the latest Intel and AMD CPUs, the vast majority of instructions generate exactly one $\mu$op. Latency, throughput, port usage, and the number of $\mu$ops for x86 instructions on recent microarchitectures can be found at the [uops.info](https://uops.info/table.html)[^1] website.

[^1]: Instruction latency and Throughput - [https://uops.info/table.html](https://uops.info/table.html)
2 changes: 1 addition & 1 deletion chapters/4-Terminology-And-Metrics/4-5 Pipeline Slot.md
Original file line number Diff line number Diff line change
@@ -4,7 +4,7 @@ typora-root-url: ..\..\img

## Pipeline Slot {#sec:PipelineSlot}

Another important metric which some performance tools use is the concept of a *pipeline slot*. A pipeline slot represents hardware resources needed to process one μop. Figure @fig:PipelineSlot demonstrates the execution pipeline of a CPU that has 4 allocation slots every cycle. That means that the core can assign execution resources (renamed source and destination registers, execution port, ROB entries, etc.) to 4 new μops every cycle. Such a processor is usually called a *4-wide machine*. During six consecutive cycles on the diagram, only half of the available slots were utilized. From a microarchitecture perspective, the efficiency of executing such code is only 50%.
Another important metric which some performance tools use is the concept of a *pipeline slot*. A pipeline slot represents hardware resources needed to process one $\mu$op. Figure @fig:PipelineSlot demonstrates the execution pipeline of a CPU that has 4 allocation slots every cycle. That means that the core can assign execution resources (renamed source and destination registers, execution port, ROB entries, etc.) to 4 new $\mu$ops every cycle. Such a processor is usually called a *4-wide machine*. During six consecutive cycles on the diagram, only half of the available slots were utilized. From a microarchitecture perspective, the efficiency of executing such code is only 50%.

![Pipeline diagram of a 4-wide CPU.](../../img/terms-and-metrics/PipelineSlot.jpg){#fig:PipelineSlot width=40% }

Original file line number Diff line number Diff line change
@@ -43,7 +43,7 @@ Latency (in core cycles)

ILP Instr.-Level-Parallelism UOPS_EXECUTED.THREAD /
per-core (average number UOPS_EXECUTED.CORE_CYCLES_GE_1,
of μops executed when divide by 2 if SMT is enabled
of $\mu$ops executed when divide by 2 if SMT is enabled
there is execution)

MLP Memory-Level-Parallelism L1D_PEND_MISS.PENDING /
Original file line number Diff line number Diff line change
@@ -47,11 +47,11 @@ The code looks good, but is it really optimal? Let's find out. We took the assem
UICA is a very simplified model of the actual CPU pipeline. For example, you may notice that instruction fetch and decode stages are missing. Also, UICA doesn't account for cache misses and branch mispredictions, so it assumes that all memory accesses always hit in L1 cache and branches are always predicted correctly. Which we know is not the case in modern processors. Again, this is irrelevant for our experiment as we could still use the simulation results to find a way to improve the code. Can you see the issue?
Let's examine the diagram. First of all, every `FMA` instruction is broken into two μops (see \circled{1}): a load μop that goes to ports `{2,3}` and an FMA μop that can go to ports `{0,1}`. The load μop has the latency of 5 cycles: starts at cycle 11 and finished at cycle 15. The FMA μop has the latency of 4 cycles: starts at cycle 19 and finishes at cycle 22. All FMA μops depend on load μops, we can clearly see this on the diagram: FMA μops always start after the corresponding load μop finishes. Now find two `r` cells at cycle 10, they are ready to be dispatched, but RocketLake has only two load ports, and both are already occupied in the same cycle. So, these two loads are issued in the next cycle.
Let's examine the diagram. First of all, every `FMA` instruction is broken into two $\mu$ops (see \circled{1}): a load $\mu$op that goes to ports `{2,3}` and an FMA $\mu$op that can go to ports `{0,1}`. The load $\mu$op has the latency of 5 cycles: starts at cycle 11 and finished at cycle 15. The FMA $\mu$op has the latency of 4 cycles: starts at cycle 19 and finishes at cycle 22. All FMA $\mu$ops depend on load $\mu$ops, we can clearly see this on the diagram: FMA $\mu$ops always start after the corresponding load $\mu$op finishes. Now find two `r` cells at cycle 10, they are ready to be dispatched, but RocketLake has only two load ports, and both are already occupied in the same cycle. So, these two loads are issued in the next cycle.
The loop has four cross-iteration dependencies over `ymm2-ymm5`. The FMA μop from instruction \circled{2} that writes into `ymm2` cannot start execution before instruction \circled{1} from previous iteration finishes. Notice that the FMA μop from instruction \circled{2} was dispatched right in the same cycle 22 as instruction \circled{1} finished its execution. You can observe this pattern for other FMA instructions as well.
The loop has four cross-iteration dependencies over `ymm2-ymm5`. The FMA $\mu$op from instruction \circled{2} that writes into `ymm2` cannot start execution before instruction \circled{1} from previous iteration finishes. Notice that the FMA $\mu$op from instruction \circled{2} was dispatched right in the same cycle 22 as instruction \circled{1} finished its execution. You can observe this pattern for other FMA instructions as well.
So, "what is the problem?", you ask. Look at the top right corner of the image. For each cycle, we added the number of executed FMA μops, this is not printed by UICA. It goes like `1,2,1,0,1,2,1,...`, or an average of one FMA μop per cycle. Most of the recent Intel processors have two FMA execution units, thus can issue two FMA μops per cycle. The diagram clearly shows the gap as every forth cycle there are no FMAs executed. As we figured out before, no FMA μops can be dispatched because their inputs (`ymm2-ymm5`) are not ready.
So, "what is the problem?", you ask. Look at the top right corner of the image. For each cycle, we added the number of executed FMA $\mu$ops, this is not printed by UICA. It goes like `1,2,1,0,1,2,1,...`, or an average of one FMA $\mu$op per cycle. Most of the recent Intel processors have two FMA execution units, thus can issue two FMA $\mu$ops per cycle. The diagram clearly shows the gap as every forth cycle there are no FMAs executed. As we figured out before, no FMA $\mu$ops can be dispatched because their inputs (`ymm2-ymm5`) are not ready.
To increase the utilization of FMA execution units from 50% to 100%, we need to unroll the loop by a factor of two. This will double the number of accumulators from 4 to 8. Also, instead of 4 independent data flow chains, we would have 8. We will not shown the simulations of unrolled version here, you can experiment on your own. Instead, let us confirm the hypothesis by running two version on a real HW. By the way, this is always a good idea to verify since static performance analyzers like UICA are not accurate models. Below, we show the output of two [nanobench](https://github.com/andreas-abel/nanoBench) tests that we ran on a recent Alderlake processor. The tool takes provided assembly instructions (`-asm` options) and creates a benchmark kernel. Readers can look up the meaning of other parameters in the nanobench documentation. The original code on the left executes 4 instructions in 4 cycles, while improved version can execute 8 instructions in 4 cycles. Now we can be sure we maximized the FMA execution throughput, the code on the right keeps the FMA units busy all the time.
Original file line number Diff line number Diff line change
@@ -6,11 +6,11 @@ typora-root-url: ..\..\img

Top-down Microarchitecture Analysis (TMA) methodology is a very powerful technique for identifying CPU bottlenecks in the program. It is a robust and formal methodology that is easy to use even for inexperienced developers. The best part of this methodology is that it does not require a developer to have a deep understanding of the microarchitecture and PMCs in the system and still efficiently find CPU bottlenecks.

At a conceptual level, TMA identifies what was stalling the execution of a program. Figure @fig:TMA_concept illustrates the core idea of TMA. This is not how the analysis works in practice, because analyzing every single microoperation (μop) would be terribly slow. Nevertheless, the diagram is helpful for understanding the methodology.
At a conceptual level, TMA identifies what was stalling the execution of a program. Figure @fig:TMA_concept illustrates the core idea of TMA. This is not how the analysis works in practice, because analyzing every single microoperation ($\mu$op) would be terribly slow. Nevertheless, the diagram is helpful for understanding the methodology.

![The concept behind TMA's top-level breakdown. *© Image from [@TMA_ISPASS]*](../../img/pmu-features/TMAM_diag.png){#fig:TMA_concept width=80%}

Here is a short guide on how to read this diagram. As we know from [@sec:uarch], there are internal buffers in the CPU that keep track of information about μops that are being executed. Whenever a new instruction is fetched and decoded, new entries in those buffers are allocated. If a μop for the instruction was not allocated during a particular cycle of execution, it could be for one of two reasons: either we were not able to fetch and decode it (`Front End Bound`); or the Back End was overloaded with work, and resources for the new μop could not be allocated (`Back End Bound`). If a μop was allocated and scheduled for execution but never retired, this means it came from a mispredicted path (`Bad Speculation`). Finally, `Retiring` represents a normal execution. It is the bucket where we want all our μops to be, although there are exceptions which we will talk about later.
Here is a short guide on how to read this diagram. As we know from [@sec:uarch], there are internal buffers in the CPU that keep track of information about $\mu$ops that are being executed. Whenever a new instruction is fetched and decoded, new entries in those buffers are allocated. If a $\mu$op for the instruction was not allocated during a particular cycle of execution, it could be for one of two reasons: either we were not able to fetch and decode it (`Front End Bound`); or the Back End was overloaded with work, and resources for the new $\mu$op could not be allocated (`Back End Bound`). If a $\mu$op was allocated and scheduled for execution but never retired, this means it came from a mispredicted path (`Bad Speculation`). Finally, `Retiring` represents a normal execution. It is the bucket where we want all our $\mu$ops to be, although there are exceptions which we will talk about later.

To accomplish its goal, TMA observes the execution of the program by monitoring specific set of performance events and then calculating metrics based on predefined formulas. Based on those metrics, TMA characterizes the program by assigning it to one of the four high-level buckets. Each of the four high-level categories has several nested levels, which CPU vendors may choose to implement differently. Each generation of processors may have different formulas for calculating those metrics, so it's better to rely on tools to do the analysis rather than trying to calculate them yourself.

0 comments on commit da56df8

Please sign in to comment.