diff --git a/chapters/3-CPU-Microarchitecture/3-1 ISA.md b/chapters/3-CPU-Microarchitecture/3-1 ISA.md index 152d242e3f..f22cce277d 100644 --- a/chapters/3-CPU-Microarchitecture/3-1 ISA.md +++ b/chapters/3-CPU-Microarchitecture/3-1 ISA.md @@ -1,9 +1,9 @@ ## Instruction Set Architecture -The instruction set architecture (ISA) is the contract between the software and the hardware, which defines the rules of communication. Intel x86-64,[^1] ARM v8 and RISC-V are examples of current-day ISAs that are widely deployed. All of these are 64-bit architectures, i.e., all address computations use 64 bits. ISA developers and CPU architects typically ensure that software or firmware conforming to the specification will execute on any processor built using the specification. Widely deployed ISAs also typically ensure backward compatibility such that code written for the GenX version of a processor will continue to execute on GenX+i. +The instruction set architecture (ISA) is the contract between the software and the hardware, which defines the rules of communication. Intel x86-64,[^1] Armv8-A and RISC-V are examples of current-day ISAs that are widely deployed. All of these are 64-bit architectures, i.e., all address computations use 64 bits. ISA developers and CPU architects typically ensure that software or firmware conforming to the specification will execute on any processor built using the specification. Widely deployed ISAs also typically ensure backward compatibility such that code written for the GenX version of a processor will continue to execute on GenX+i. -Most modern architectures can be classified as general-purpose register-based, load-store architectures, such as RISC-V and ARM where the operands are explicitly specified, and memory is accessed only using load and store instructions. The X86 ISA is a register-memory architecture, where operations can be performed on registers, as well as memory operands. In addition to providing the basic functions in an ISA such as load, store, control and scalar arithmetic operations using integers and floating-point, the widely deployed architectures continue to enhance their ISA to support new computing paradigms. These include enhanced vector processing instructions (e.g., Intel AVX2, AVX512, ARM SVE, RISC-V "V" vector extension) and matrix/tensor instructions (Intel AMX, ARM SME). Software mapped to use these advanced instructions typically provides orders of magnitude improvement in performance. +Most modern architectures can be classified as general-purpose register-based, load-store architectures, such as RISC-V and ARM where the operands are explicitly specified, and memory is accessed only using load and store instructions. The X86 ISA is a register-memory architecture, where operations can be performed on registers, as well as memory operands. In addition to providing the basic functions in an ISA such as load, store, control and scalar arithmetic operations using integers and floating-point, the widely deployed architectures continue to augment their ISAs to support new computing paradigms. These include enhanced vector processing instructions (e.g., Intel AVX2, AVX512, ARM SVE, RISC-V "V" vector extension) and matrix/tensor instructions (Intel AMX, ARM SME). Software mapped to use these advanced instructions typically provides orders of magnitude improvement in performance. -Modern CPUs support 32-bit and 64-bit precision for floating-point and integer arithmetic operations. With the fast-evolving field of machine learning and AI, the industry has a renewed interest in alternative numeric formats for variables to drive significant performance improvements. Research has shown that machine learning models perform just as well, using fewer bits to represent variables, saving on both compute and memory bandwidth. As a result, several CPU franchises have recently added support for lower precision data types such as 8-bit integers (int8), 16-bit floating-point (fp16, bf16) in the ISA, in addition to the traditional 32-bit and 64-bit formats for arithmetic operations. +Modern CPUs support 32-bit and 64-bit precision for floating-point and integer arithmetic operations. With the fast-evolving fields of machine learning and AI, the industry has a renewed interest in alternative numeric formats to drive significant performance improvements. Research has shown that machine learning models perform just as well using fewer bits to represent variables, saving on both compute and memory bandwidth. As a result, several CPU franchises have recently added support for lower precision data types such as 8-bit integers (int8) and 16-bit floating-point (fp16, bf16) to the ISA, in addition to the traditional 32-bit and 64-bit formats for arithmetic operations. -[^1]: In the book we sometimes write x86 for shortness, but we assume x86-64, which is a 64-bit version of the x86 instruction set, first announced in 1999. \ No newline at end of file +[^1]: In the book we sometimes write x86 for shortness, but we assume x86-64, which is a 64-bit version of the x86 instruction set, first announced in 1999. diff --git a/chapters/3-CPU-Microarchitecture/3-11 Chapter summary.md b/chapters/3-CPU-Microarchitecture/3-11 Chapter summary.md index dc0b8b38df..4b8db6141a 100644 --- a/chapters/3-CPU-Microarchitecture/3-11 Chapter summary.md +++ b/chapters/3-CPU-Microarchitecture/3-11 Chapter summary.md @@ -6,8 +6,8 @@ * The details of the implementation are encapsulated in the term CPU "microarchitecture". This topic has been researched by thousands of computer scientists for a long time. Through the years, many smart ideas were invented and implemented in mass-market CPUs. The most notable are pipelining, out-of-order execution, superscalar engines, speculative execution and SIMD processors. All these techniques help exploit Instruction-Level Parallelism (ILP) and improve singe-threaded performance. * In parallel with single-threaded performance, hardware designers began pushing multi-threaded performance. The vast majority of modern client-facing devices have a processor containing multiple cores. Some processors double the number of observable CPU cores with the help of Simultaneous Multithreading (SMT). SMT enables multiple software threads to run simultaneously on the same physical core using shared resources. A more recent technique in this direction is called "hybrid" processors which combine different types of cores in a single package to better support a diversity of workloads. * The memory hierarchy in modern computers includes several levels of cache that reflect different tradeoffs in speed of access vs. size. L1 cache tends to be closest to a core, fast but small. The L3/LLC cache is slower but also bigger. DDR is the predominant DRAM technology used in most platforms. DRAM modules vary in the number of ranks and memory width which may have a slight impact on system performance. Processors may have multiple memory channels to access more than one DRAM module simultaneously. -* Virtual memory is the mechanism for sharing physical memory with all the processes running on the CPU. Programs use virtual addresses in their accesses, which get translated into physical addresses. The memory space is split into pages. The default page size on x86 is 4KB, and on ARM is 16KB. Only the page address gets translated, the offset within the page is used as is. The OS keeps the translation in the page table, which is implemented as a radix tree. There are hardware features that improve the performance of address translation: mainly the Translation Lookaside Buffer (TLB) and hardware page walkers. Also, developers can utilize Huge Pages to mitigate the cost of address translation in some cases (see [@sec:secDTLB]). -* We looked at the design of Intel's recent GoldenCove microarchitecture. Logically, the core is split into a Front End and a Back End. The Front-End consists of a Branch Predictor Unit (BPU), L1-I cache, instruction fetch and decode logic, and the IDQ, which feeds instructions to the CPU Back End. The Back-End consists of the OOO engine, execution units, the load-store unit, the L1-D cache, and the TLB hierarchy. +* Virtual memory is the mechanism for sharing physical memory with all the processes running on the CPU. Programs use virtual addresses in their accesses, which get translated into physical addresses. The memory space is split into pages. The default page size on x86 is 4KB, and on ARM is 16KB. Only the page address gets translated, the offset within the page is used as-is. The OS keeps the translation in the page table, which is implemented as a radix tree. There are hardware features that improve the performance of address translation: mainly the Translation Lookaside Buffer (TLB) and hardware page walkers. Also, developers can utilize Huge Pages to mitigate the cost of address translation in some cases (see [@sec:secDTLB]). +* We looked at the design of Intel's recent GoldenCove microarchitecture. Logically, the core is split into a Front End and a Back End. The Frontend consists of a Branch Predictor Unit (BPU), L1-I cache, instruction fetch and decode logic, and the IDQ, which feeds instructions to the CPU Back End. The Back-End consists of the OOO engine, execution units, the load-store unit, the L1-D cache, and the TLB hierarchy. * Modern processors have performance monitoring features that are encapsulated into a Performance Monitoring Unit (PMU). This unit is built around a concept of Performance Monitoring Counters (PMC) that enables observation of specific events that happen while a program is running, for example, cache misses and branch mispredictions. \sectionbreak diff --git a/chapters/3-CPU-Microarchitecture/3-2 Pipelining.md b/chapters/3-CPU-Microarchitecture/3-2 Pipelining.md index a68533322f..9becb0d674 100644 --- a/chapters/3-CPU-Microarchitecture/3-2 Pipelining.md +++ b/chapters/3-CPU-Microarchitecture/3-2 Pipelining.md @@ -1,6 +1,6 @@ ## Pipelining -Pipelining is the foundational technique used to make CPUs fast wherein multiple instructions are overlapped during their execution. Pipelining in CPUs drew inspiration from the automotive assembly lines. The processing of instructions is divided into stages. The stages operate in parallel, working on different parts of different instructions. DLX is a relatively simple architecture designed by John L. Hennessy and David A. Patterson in 1994. As defined in [@Hennessy], it has a 5-stage pipeline which consists of: +Pipelining is a foundational technique used to make CPUs fast wherein multiple instructions are overlapped during their execution. Pipelining in CPUs drew inspiration from automotive assembly lines. The processing of instructions is divided into stages. The stages operate in parallel, working on different parts of different instructions. DLX is a relatively simple architecture designed by John L. Hennessy and David A. Patterson in 1994. As defined in [@Hennessy], it has a 5-stage pipeline which consists of: 1. Instruction fetch (IF) 2. Instruction decode (ID) @@ -35,9 +35,9 @@ In real implementations, pipelining introduces several constraints that limit th R2 = R1 ADD 2 ``` - There is a RAW dependency for register R1. If we take the value directly after addition `R0 ADD 1` is done (from the `EXE` pipeline stage), we don't need to wait until the `WB` stage finishes, and the value will be written to the register file. Bypassing helps to save a few cycles. The longer the pipeline, the more effective bypassing becomes. + There is a RAW dependency for register R1. If we take the value directly after addition `R0 ADD 1` is done (from the `EXE` pipeline stage), we don't need to wait until the `WB` stage finishes (when the value will be written to the register file). Bypassing helps to save a few cycles. The longer the pipeline, the more effective bypassing becomes. - A *write-after-read* (WAR) hazard requires a dependent write to execute after a read. It occurs when instruction `x+1` writes a source before instruction `x` reads the source, resulting in the wrong new value being read. A WAR hazard is not a true dependency and is eliminated by a technique called *register renaming*. It is a technique that abstracts logical registers from physical registers. CPUs support register renaming by keeping a large number of physical registers. Logical (*architectural*) registers, the ones that are defined by the ISA, are just aliases over a wider register file. With such decoupling of the *architectural state, solving WAR hazards is simple: we just need to use a different physical register for the write operation. For example: + A *write-after-read* (WAR) hazard requires a dependent write to execute after a read. It occurs when instruction `x+1` writes a source before instruction `x` reads the source, resulting in the wrong new value being read. A WAR hazard is not a true dependency and is eliminated by a technique called *register renaming*. It is a technique that abstracts logical registers from physical registers. CPUs support register renaming by keeping a large number of physical registers. Logical (*architectural*) registers, the ones that are defined by the ISA, are just aliases over a wider register file. With such decoupling of the *architectural* state, solving WAR hazards is simple: we just need to use a different physical register for the write operation. For example: ``` ; machine code, WAR hazard ; after register renaming @@ -61,4 +61,4 @@ In real implementations, pipelining introduces several constraints that limit th * **Control hazards**: are caused due to changes in the program flow. They arise from pipelining branches and other instructions that change the program flow. The branch condition that determines the direction of the branch (taken vs. not taken) is resolved in the execute pipeline stage. As a result, the fetch of the next instruction cannot be pipelined unless the control hazard is eliminated. Techniques such as dynamic branch prediction and speculative execution described in the next section are used to overcome control hazards. -\lstset{linewidth=\textwidth} \ No newline at end of file +\lstset{linewidth=\textwidth} diff --git a/chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md b/chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md index 348b4ac284..cba64356fc 100644 --- a/chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md +++ b/chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md @@ -1,30 +1,30 @@ ## Exploiting Instruction Level Parallelism (ILP) -Most instructions in a program lend themselves to be pipelined and executed in parallel, as they are independent. Modern CPUs implement a large menu of additional hardware features to exploit such instruction-level parallelism (ILP), i.e., parallelism within a single stream of instructions. Working in concert with advanced compiler techniques, these hardware features provide significant performance improvements. +Most instructions in a program lend themselves to be pipelined and executed in parallel, as they are independent. Modern CPUs implement numerous additional hardware features to exploit such instruction-level parallelism (ILP), i.e., parallelism within a single stream of instructions. Working in concert with advanced compiler techniques, these hardware features provide significant performance improvements. ### Out-Of-Order (OOO) Execution The pipeline example in Figure @fig:Pipelining shows all instructions moving through the different stages of the pipeline in order, i.e., in the same order as they appear in the program, also known as *program order*. Most modern CPUs support *out-of-order* (OOO) execution, where sequential instructions can enter the execution stage in any arbitrary order only limited by their dependencies and resource availability. CPUs with OOO execution must still give the same result as if all instructions were executed in the program order. -An instruction is called *retired* after it is finally executed, and its results are correct and visible in the architectural state. To ensure correctness, CPUs must retire all instructions in the program order. OOO execution is primarily used to avoid underutilization of CPU resources due to stalls caused by dependencies, especially in superscalar engines, which we will discuss shortly. +An instruction is called *retired* after it is finally executed, and its results are correct and visible in the architectural state. To ensure correctness, CPUs must retire all instructions in the program order. OOO execution is primarily used to avoid underutilization of CPU resources due to stalls caused by dependencies, especially in superscalar engines, which we will discuss shortly. ![The concept of out-of-order execution.](../../img/uarch/OOO.png){#fig:OOO width=90%} -Figure @fig:OOO illustrates the concept of out-of-order execution. Let's assume that instruction `x+1` cannot be executed during cycles 4 and 5 due to some conflict. An in-order CPU would stall all subsequent instructions from entering the EXE pipeline stage, so instruction `x+2` would begin executing only at cycle 7. In a CPU with OOO execution, an instruction can begin executing as long as it does not have any conflicts (e.g., its inputs are available, execution unit is not occupied, etc.). As you can see on the diagram, instruction `x+2` started executing before instruction `x+1`. All instructions still retire in order, i.e., the instructions complete the WB stage in the program order. +Figure @fig:OOO illustrates the concept of out-of-order execution. Let's assume that instruction `x+1` cannot be executed during cycles 4 and 5 due to some conflict. An in-order CPU would stall all subsequent instructions from entering the EXE pipeline stage, so instruction `x+2` would begin executing only at cycle 7. In a CPU with OOO execution, an instruction can begin executing as long as it does not have any conflicts (e.g., its inputs are available, execution unit is not occupied, etc.). As you can see on the diagram, instruction `x+2` started executing before instruction `x+1`. All instructions still retire in order, i.e., the instructions complete the WB stage in the program order. The process of reordering instructions is often called instruction *scheduling*. The goal of scheduling is to issue instructions in such a way as to minimize pipeline hazards and maximize the utilization of CPU resources. Instruction scheduling can be done at compile time (static scheduling), or at runtime (dynamic scheduling). Let's unpack both options. #### Static scheduling -The Intel Itanium which was meant as a replacement for the x86, is an example of static scheduling. With static scheduling of a superscalar, multi-execution unit machine, the scheduling is moved from the hardware to the compiler using a technique known as VLIW - Very Long Instruction Word. The rationale is to simplify the hardware by requiring the compiler to choose the right mix of instructions to keep the machine fully utilized. Compilers can use techniques such as software pipelining and loop unrolling to look farther ahead than can be reasonably supported by hardware structures to find the right ILP. +The Intel Itanium was an example of static scheduling. With static scheduling of a superscalar, multi-execution unit machine, the scheduling is moved from the hardware to the compiler using a technique known as VLIW (Very Long Instruction Word). The rationale is to simplify the hardware by requiring the compiler to choose the right mix of instructions to keep the machine fully utilized. Compilers can use techniques such as software pipelining and loop unrolling to look farther ahead than can be reasonably supported by hardware structures to find the right ILP. The Intel Itanium never managed to become a success for a few reasons. One of them was a lack of compatibility with x86 code. Another was the fact that it was very difficult for compilers to schedule instructions in such a way as to keep the CPU busy due to variable load latencies. The 64-bit extension of the x86 ISA (x86-64) was introduced in the same time window by AMD and was compatible with x86 and eventually became the real successor of the x86. The last Intel Itanium processors were shipped in 2021. #### Dynamic scheduling -To overcome the problem with static scheduling, modern processors use dynamic scheduling. The two most important algorithms for dynamic scheduling are [Scoreboarding](https://en.wikipedia.org/wiki/Scoreboarding),[^4] and the [Tomasulo algorithm](https://en.wikipedia.org/wiki/Tomasulo_algorithm).[^5] +To overcome the problem with static scheduling, modern processors use dynamic scheduling. The two most important algorithms for dynamic scheduling are [Scoreboarding](https://en.wikipedia.org/wiki/Scoreboarding)[^4] and the [Tomasulo algorithm](https://en.wikipedia.org/wiki/Tomasulo_algorithm).[^5] -Scoreboarding was first implemented in the CDC6600 processor in the 1960s. Its main drawback is that it not only preserves true dependencies (RAW), but also false dependencies (WAW and WAR), and therefore it provides suboptimal ILP. False dependencies are caused by the small number of architectural registers, which is typically between 16 and 32 in modern ISAs. That is why all modern processors have adopted the Tomasulo algorithm for dynamic scheduling. The Tomasulo algorithm was invented in the 1960s by Robert Tomasulo and first implemented in the IBM360 model 91. +Scoreboarding was first implemented in the CDC6600 processor in the 1960s. Its main drawback is that it not only preserves true dependencies (RAW), but also false dependencies (WAW and WAR), and therefore it provides suboptimal ILP. False dependencies are caused by the small number of architectural registers, which is typically between 16 and 32 in modern ISAs. That is why all modern processors have adopted the Tomasulo algorithm for dynamic scheduling. The Tomasulo algorithm was invented in the 1960s by Robert Tomasulo and first implemented in the IBM360 Model 91. To eliminate false dependencies, the Tomasulo algorithm makes use of register renaming, which we discussed in the previous section. Because of that, performance is greatly improved compared to scoreboarding. However, a sequence of instructions that carry a RAW dependency, also known as a *dependency chain*, is still problematic for OOO execution, because there is no increase in the ILP after register renaming since all the RAW dependencies are preserved. Dependency chains are often found in loops (loop carried dependency) where the current loop iteration depends on the results produced on the previous iteration. @@ -34,7 +34,7 @@ From the ROB, instructions are inserted in the RS, which has much fewer entries. ### Superscalar Engines -Most modern CPUs are superscalar i.e., they can issue more than one instruction in a given cycle. Issue width is the maximum number of instructions that can be issued during the same cycle. The typical issue width of a mainstream CPU in 2024 ranges from 6 to 9. To ensure the right balance, such superscalar engines also have more than one execution unit and/or pipelined execution units. CPUs also combine superscalar capability with deep pipelines and out-of-order execution to extract the maximum ILP for a given piece of software. +Most modern CPUs are superscalar, i.e. they can issue more than one instruction in a given cycle. Issue width is the maximum number of instructions that can be issued during the same cycle. The typical issue width of a mainstream CPU in 2024 ranges from 6 to 9. To ensure the right balance, such superscalar engines also have more than one execution unit and/or pipelined execution units. CPUs also combine superscalar capability with deep pipelines and out-of-order execution to extract the maximum ILP for a given piece of software. ![A pipeline diagram of a code executing in a 2-way superscalar CPU.](../../img/uarch/SuperScalar.png){#fig:SuperScalar width=65%} @@ -42,9 +42,9 @@ Figure @fig:SuperScalar shows a pipeline diagram of a CPU that supports 2-wide i ### Speculative Execution {#sec:SpeculativeExec} -As noted in the previous section, control hazards can cause significant performance loss in a pipeline if instructions are stalled until the branch condition is resolved. One technique to avoid this performance loss is hardware *branch prediction*. Using this technique, a CPU predicts the likely direction of branches and allow executing instructions from the predicted path (known as *speculative execution*). +As noted in the previous section, control hazards can cause significant performance loss in a pipeline if instructions are stalled until the branch condition is resolved. One technique to avoid this performance loss is hardware *branch prediction*. Using this technique, a CPU predicts the likely target of branches and executes instructions from the predicted path (known as *speculative execution*). -Let's consider an example in @lst:Speculative. For a processor to understand which function it should execute next, it should know whether the condition `a < b` is false or true. Without knowing that, the CPU waits until the result of the branch instruction is determined, as shown in Figure @fig:NoSpeculation. +Let's consider an example in @lst:Speculative. For a processor to understand which function it should execute next, it should know whether the condition `a < b` is false or true. Without knowing that, the CPU waits until the result of the branch instruction is determined, as shown in Figure @fig:NoSpeculation. Listing: Speculative execution @@ -78,18 +78,18 @@ As we just have seen, correct predictions greatly improve execution as they allo * **Conditional branches**: they have two potential outcomes: taken or not taken. Taken branches can go forward or backward. Forward conditional branches are usually generated for `if-else` statements, which have a high chance of not being taken, as they frequently represent error-checking code. Backward conditional jumps are frequently seen in loops and are used to go to the next iteration of a loop; such branches are usually taken. * **Indirect calls and jumps**: they have many targets. An indirect jump or indirect call can be generated for a `switch` statement, a function pointer, or a `virtual` function call. A return from a function deserves attention because it has many potential targets as well. -Most prediction algorithms are based on previous outcomes of the branch. The core of the branch prediction unit (BPU) is a branch target buffer (BTB), which caches the target addresses for every branch. Prediction algorithms consult the BTB every cycle to generate the next address from which to fetch instructions. The CPU uses that new address to fetch the next block of instructions. If no branches are identified in the current fetch block, the next address to fetch will be the next sequential aligned fetch block (fall through). +Most prediction algorithms are based on previous outcomes of the branch. The core of the branch prediction unit (BPU) is a branch target buffer (BTB), which caches the target addresses for every branch. Prediction algorithms consult the BTB every cycle to generate the next address from which to fetch instructions. The CPU uses that new address to fetch the next block of instructions. If no branches are identified in the current fetch block, the next address to fetch will be the next sequential aligned fetch block (fall through). -Unconditional branches do not require prediction; we just need to look up the target address in the BTB. Every cycle the BPU needs to generate the next address from which to fetch instructions to avoid pipeline stalls. We could have extracted the address just from the instruction encoding itself, but then we have to wait until the decode stage is over, which will introduce a bubble in the pipeline and make things slower. So, the next fetch address has to be determined at the time when the branch is fetched. +Unconditional branches do not require prediction; we just need to look up the target address in the BTB. Every cycle the BPU needs to generate the next address from which to fetch instructions to avoid pipeline stalls. We could have extracted the address just from the instruction encoding itself, but then we have to wait until the decode stage is over, which will introduce a bubble in the pipeline and make things slower. So, the next fetch address has to be determined at the time when the branch is fetched. -For conditional branches, we first need to predict whether the branch will be taken or not. If it is not taken, then we fall through and there is no need to look up the target. Otherwise, we look up the target address in the BTB. Conditional branches usually account for the biggest portion of total branches and are the main source of misprediction penalties in production software. For indirect branches, we need to select one of the possible targets, but the prediction algorithm can be very similar to conditional branches. +For conditional branches, we first need predict whether the branch will be taken or not. If it is not taken, then we fall through and there is no need to look up the target. Otherwise, we look up the target address in the BTB. Conditional branches usually account for the biggest portion of total branches and are the main source of misprediction penalties in production software. For indirect branches, we need to select one of the possible targets, but the prediction algorithm can be very similar to conditional branches. All prediction mechanisms try to exploit two important principles, which are similar to what we will discuss with caches later: * **Temporal correlation**: the way a branch resolves may be a good predictor of the way it will resolve at the next execution. This is also known as local correlation. * **Spatial correlation**: several adjacent branches may resolve in a highly correlated manner (a preferred path of execution). This is also known as global correlation. -The best accuracy is often achieved by leveraging local and global correlations together. So, not only do we look at the outcome history of the current branch, but also we correlate it with the outcomes of other branches. +The best accuracy is often achieved by leveraging local and global correlations together. So, not only do we look at the outcome history of the current branch, but also we correlate it with the outcomes of other branches. Another common technique used is called hybrid prediction. The idea is that some branches have biased behavior. For example, if a conditional branch goes in one direction 99.9% of the time, there is no need to use a complex predictor and pollute its data structures. A much simpler mechanism can be used instead. Another example is a loop branch. If a branch has loop behavior, then it can be predicted using a dedicated loop predictor, which will remember the number of iterations the loop typically executes. @@ -97,4 +97,4 @@ Today, state-of-the-art prediction is dominated by TAGE-like [@Seznec2006] or pe [^4]: Scoreboarding - [https://en.wikipedia.org/wiki/Scoreboarding](https://en.wikipedia.org/wiki/Scoreboarding). [^5]: Tomasulo algorithm - [https://en.wikipedia.org/wiki/Tomasulo_algorithm](https://en.wikipedia.org/wiki/Tomasulo_algorithm). -[^6]: 5th Championship Branch Prediction competition - [https://jilp.org/cbp2016](https://jilp.org/cbp2016). \ No newline at end of file +[^6]: 5th Championship Branch Prediction competition - [https://jilp.org/cbp2016](https://jilp.org/cbp2016). diff --git a/chapters/3-CPU-Microarchitecture/3-4 SIMD.md b/chapters/3-CPU-Microarchitecture/3-4 SIMD.md index e8407a6edf..cb93f5252c 100644 --- a/chapters/3-CPU-Microarchitecture/3-4 SIMD.md +++ b/chapters/3-CPU-Microarchitecture/3-4 SIMD.md @@ -19,15 +19,15 @@ For regular SISD instructions, processors utilize general-purpose registers. Sim A vector execution unit is logically divided into *lanes*. In the context of SIMD, a lane refers to a distinct data pathway within the SIMD execution unit and processes one element of the vector. In our example, each lane processes 64-bit elements (double-precision), so there will be 4 lanes in a 256-bit register. -Most of the popular CPU architectures feature vector instructions, including x86, PowerPC, ARM, and RISC-V. In 1996 Intel released MMX, a SIMD instruction set, that was designed for multimedia applications. Following MMX, Intel introduced new instruction sets with added capabilities and increased vector size: SSE, AVX, AVX2, AVX-512. Arm has optionally supported the 128-bit NEON instruction set in various versions of its architecture. In version 8 (aarch64), this support was made mandatory, and new instructions were added. +Most of the popular CPU architectures feature vector instructions, including x86, PowerPC, ARM, and RISC-V. In 1996 Intel released MMX, a SIMD instruction set, that was designed for multimedia applications. Following MMX, Intel introduced new instruction sets with added capabilities and increased vector size: SSE, AVX, AVX2, and AVX-512. Arm has optionally supported the 128-bit NEON instruction set in various versions of its architecture. In version 8 (aarch64), this support was made mandatory, and new instructions were added. As the new instruction sets became available, work began to make them usable to software engineers. The software changes required to exploit SIMD instructions are known as *code vectorization*. Initially, SIMD instructions were programmed in assembly. Later, special compiler intrinsics, which are small functions providing a one-to-one mapping to SIMD instructions, were introduced. Today all the major compilers support autovectorization for the popular processors, i.e., they can generate SIMD instructions straight from high-level code written in C/C++, Java, Rust and other languages. To enable code to run on systems that support different vector lengths, Arm introduced the SVE instruction set. Its defining characteristic is the concept of *scalable vectors*: their length is unknown at compile time. With SVE, there is no need to port software to every possible vector length. Users don't have to recompile the source code of their applications to leverage wider vectors when they become available in newer CPU generations. Another example of scalable vectors is the RISC-V V extension (RVV), which was ratified in late 2021. Some implementations support quite wide (2048-bit) vectors, and up to eight can be grouped together to yield 16384-bit vectors, which greatly reduces the number of instructions executed. At each loop iteration, user code typically does `ptr += number_of_lanes`, where `number_of_lanes` is not known at compile time. ARM SVE provides special instructions for such length-dependent operations, while RVV enables a programmer to query/set the `number_of_lanes`. -Going back to the example in @lst:SIMD, if `N` equals 5, and we have a 256-bit vector, we cannot process all the elements in a single iteration. We can process the first four elements using a single SIMD instruction, but the 5th element needs to be processed individually. This is known as the *loop remainder*. Loop remainder is a portion of a loop that must process fewer elements than the vector width, requiring additional scalar code to handle the leftover elements. Scalable vector ISA extensions do not have this problem, as they can process any number of elements in a single instruction. Another solution to the loop remainder problem is to use *masking*, which allows selectively enable or disable SIMD lanes based on a condition. +Going back to the example in @lst:SIMD, if `N` equals 5, and we have a 256-bit vector, we cannot process all the elements in a single iteration. We can process the first four elements using a single SIMD instruction, but the 5th element needs to be processed individually. This is known as the *loop remainder*. Loop remainder is a portion of a loop that must process fewer elements than the vector width, requiring additional scalar code to handle the leftover elements. Scalable vector ISA extensions do not have this problem, as they can process any number of elements in a single instruction. Another solution to the loop remainder problem is to use *masking*, which allows selectively enabling or disabling SIMD lanes based on a condition. -Also, CPUs increasingly accelerate the matrix multiplications often used in machine learning. Intel's AMX extension, supported in server processors since 2023, multiplies 8-bit matrices of shape 16x64 and 64x16, accumulating into a 32-bit 16x16 matrix. By contrast, the unrelated but identically named AMX extension in Apple CPUs, as well as ARM's SME extension, computes outer products of a row and column, respectively stored in special 512-bit registers or scalable vectors. +Also, CPUs increasingly accelerate the matrix multiplications often used in machine learning. Intel's AMX extension, supported in server processors since 2023, multiplies 8-bit matrices of shape 16x64 and 64x16, accumulating into a 32-bit 16x16 matrix. By contrast, the unrelated but identically named AMX extension in Apple CPUs, as well as ARM's SME extension, compute outer products of a row and column, respectively stored in special 512-bit registers or scalable vectors. Initially, SIMD was driven by multimedia applications and scientific computations, but later found uses in many other domains. Over time, the set of operations supported in SIMD instruction sets has steadily increased. In addition to straightforward arithmetic as shown in Figure @fig:SIMD, newer use cases of SIMD include: diff --git a/chapters/3-CPU-Microarchitecture/3-5 Exploiting TLP.md b/chapters/3-CPU-Microarchitecture/3-5 Exploiting TLP.md index abddc98b6d..600c803b05 100644 --- a/chapters/3-CPU-Microarchitecture/3-5 Exploiting TLP.md +++ b/chapters/3-CPU-Microarchitecture/3-5 Exploiting TLP.md @@ -24,7 +24,7 @@ With an SMT2 implementation, each physical core is represented by two logical co Although two programs run on the same processor core, they are completely separated from each other. In an SMT-enabled processor, even though instructions are mixed, they have different contexts which helps preserve the correctness of execution. To support SMT, a CPU must replicate the architectural state (program counter, registers) to maintain thread context. Other CPU resources can be shared. In a typical implementation, cache resources are dynamically shared amongst the hardware threads. Resources to track OOO and speculative execution can either be replicated or partitioned. -In an SMT2 core, both logical cores are truly running at the same time. In the CPU front-end, they fetch instructions in an alternating order (every cycle or a few cycles). In the back-end, each cycle a processor selects instructions for execution from all threads. Instruction execution is mixed as the processor dynamically schedules execution units among both threads. +In an SMT2 core, both logical cores are truly running at the same time. In the CPU frontend, they fetch instructions in an alternating order (every cycle or a few cycles). In the backend, each cycle a processor selects instructions for execution from all threads. Instruction execution is mixed as the processor dynamically schedules execution units among both threads. So, SMT is a very flexible setup that makes it possible to recover unused CPU issue slots. SMT provides equal single-thread performance, in addition to its benefits for multiple threads. Modern multi-threaded CPUs support either two-way (SMT2) or four-way (SMT4). diff --git a/chapters/3-CPU-Microarchitecture/3-6 Memory Hierarchy.md b/chapters/3-CPU-Microarchitecture/3-6 Memory Hierarchy.md index 11d2fdeed1..0433bc7ec9 100644 --- a/chapters/3-CPU-Microarchitecture/3-6 Memory Hierarchy.md +++ b/chapters/3-CPU-Microarchitecture/3-6 Memory Hierarchy.md @@ -13,9 +13,9 @@ A cache is the first level of the memory hierarchy for any request (for code or Caches are organized as blocks with a defined size, also known as *cache lines*. The typical cache line size in modern CPUs is 64 bytes. However, the notable exception here is the L2 cache in Apple processors (such as M1, M2 and later), which operates on 128B cache lines. Caches closest to the execution pipeline typically range in size from 32 KB to 128 KB. Mid-level caches tend to have 1MB and above. Last-level caches in modern CPUs can be tens or even hundreds of megabytes. -#### Placement of Data within the Cache. +#### Placement of Data within the Cache. -The address for a request is used to access the cache. In *direct-mapped* caches, a given block address can appear only in one location in the cache and is defined by a mapping function shown below. +The address for a request is used to access the cache. In *direct-mapped* caches, a given block address can appear only in one location in the cache and is defined by a mapping function shown below. $$ \textrm{Number of Blocks in the Cache} = \frac{\textrm{Cache Size}}{\textrm{Cache Block Size}} $$ @@ -23,7 +23,7 @@ $$ \textrm{Direct mapped location} = \textrm{(block address) mod (Number of Blocks in the Cache )} $$ -In a *fully associative* cache, a given block can be placed in any location in the cache. +In a *fully associative* cache, a given block can be placed in any location in the cache. An intermediate option between direct mapping and fully associative mapping is a *set-associative* mapping. In such a cache, the blocks are organized as sets, typically each set containing 2, 4, 8 or 16 blocks. A given address is first mapped to a set. Within a set, the address can be placed anywhere, among the blocks in that set. A cache with m blocks per set is described as an m-way set-associative cache. The formulas for a set-associative cache are: $$ @@ -39,35 +39,35 @@ Here is another example of the cache organization of the Apple M1 processor. The #### Finding Data in the Cache. -Every block in the m-way set-associative cache has an address tag associated with it. In addition, the tag also contains state bits such as valid bits to indicate whether the data is valid. Tags can also contain additional bits to indicate access information, sharing information, etc., that will be described in later sections of this chapter. +Every block in the m-way set-associative cache has an address tag associated with it. In addition, the tag also contains state bits such as valid bits to indicate whether the data is valid. Tags can also contain additional bits to indicate access information, sharing information, etc., that will be described in later sections of this chapter. ![Address organization for cache lookup.](../../img/uarch/CacheLookup.png){#fig:CacheLookup width=90%} Figure @fig:CacheLookup shows how the address generated from the pipeline is used to check the caches. The lowest order address bits define the offset within a given block; the block offset bits (5 bits for 32-byte cache lines, 6 bits for 64-byte cache lines). The set is selected using the index bits based on the formulas described above. Once the set is selected, the tag bits are used to compare against all the tags in that set. If one of the tags matches the tag of the incoming request and the valid bit is set, a cache hit results. The data associated with that block entry (read out of the data array of the cache in parallel to the tag lookup) is provided to the execution pipeline. A cache miss occurs in cases where the tag is not a match. -#### Managing Misses. +#### Managing Misses. When a cache miss occurs, the controller must select a block in the cache to be replaced to allocate the address that incurred the miss. For a direct-mapped cache, since the new address can be allocated only in a single location, the previous entry mapping to that location is deallocated, and the new entry is installed in its place. In a set-associative cache, since the new cache block can be placed in any of the blocks of the set, a replacement algorithm is required. The typical replacement algorithm used is the LRU (least recently used) policy, where the block that was least recently accessed is evicted to make room for the new data. Another alternative is to randomly select one of the blocks as the victim block. -#### Managing Writes. +#### Managing Writes. Write accesses to caches are less frequent than data reads. Handling writes in caches is harder, and CPU implementations use various techniques to handle this complexity. Software developers should pay special attention to the various write caching flows supported by the hardware to ensure the best performance of their code. CPU designs use two basic mechanisms to handle writes that hit in the cache: * In a write-through cache, hit data is written to both the block in the cache and to the next lower level of the hierarchy. -* In a write-back cache, hit data is only written to the cache. Subsequently, lower levels of the hierarchy contain stale data. The state of the modified line is tracked through a dirty bit in the tag. When a modified cache line is eventually evicted from the cache, a write-back operation forces the data to be written back to the next lower level. +* In a write-back cache, hit data is only written to the cache. Subsequently, lower levels of the hierarchy contain stale data. The state of the modified line is tracked through a dirty bit in the tag. When a modified cache line is eventually evicted from the cache, a write-back operation forces the data to be written back to the next lower level. Cache misses on write operations can be handled in two ways: * In a *write-allocate or fetch on write miss* cache, the data for the missed location is loaded into the cache from the lower level of the hierarchy, and the write operation is subsequently handled like a write hit. -* If the cache uses a *no-write-allocate policy*, the cache miss transaction is sent directly to the lower levels of the hierarchy, and the block is not loaded into the cache. +* If the cache uses a *no-write-allocate policy*, the cache miss transaction is sent directly to the lower levels of the hierarchy, and the block is not loaded into the cache. Out of these options, most designs typically choose to implement a write-back cache with a write-allocate policy as both of these techniques try to convert subsequent write transactions into cache hits, without additional traffic to the lower levels of the hierarchy. Write-through caches typically use the no-write-allocate policy. -#### Other Cache Optimization Techniques. +#### Other Cache Optimization Techniques. -For a programmer, understanding the behavior of the cache hierarchy is critical to extracting performance from any application. From the perspective of the CPU pipeline, the latency to access any request is given by the following formula that can be applied recursively to all the levels of the cache hierarchy up to the main memory: +For a programmer, understanding the behavior of the cache hierarchy is critical to extracting performance from any application. From the perspective of the CPU pipeline, the latency to access any request is given by the following formula that can be applied recursively to all the levels of the cache hierarchy up to the main memory: $$ \textrm{Average Access Latency} = \textrm{Hit Time } + \textrm{ Miss Rate } \times \textrm{ Miss Penalty} $$ @@ -75,15 +75,15 @@ Hardware designers take on the challenge of reducing the hit time and miss penal #### Hardware and Software Prefetching. {#sec:HwPrefetch} -One method to avoid cache misses and subsequent stalls is to prefetch data into caches prior to when the pipeline demands it. The assumption is the time to handle the miss penalty can be mostly hidden if the prefetch request is issued sufficiently ahead in the pipeline. Most CPUs provide implicit hardware-based prefetching that is complemented by explicit software prefetching that programmers can control. +One method to avoid cache misses and subsequent stalls is to prefetch data into caches prior to when the pipeline demands it. The assumption is the time to handle the miss penalty can be mostly hidden if the prefetch request is issued sufficiently ahead in the pipeline. Most CPUs provide implicit hardware-based prefetching that is complemented by explicit software prefetching that programmers can control. Hardware prefetchers observe the behavior of a running application and initiate prefetching on repetitive patterns of cache misses. Hardware prefetching can automatically adapt to the dynamic behavior of an application, such as varying data sets, and does not require support from an optimizing compiler. Also, the hardware prefetching works without the overhead of additional address generation and prefetch instructions. However, hardware prefetching works for a limited set of commonly used data access patterns. -Software memory prefetching complements prefetching done by hardware. Developers can specify which memory locations are needed ahead of time via dedicated hardware instruction (see [@sec:memPrefetch]). Compilers can also automatically add prefetch instructions into the code to request data before it is required. Prefetch techniques need to balance between demand and prefetch requests to guard against prefetch traffic slowing down demand traffic. +Software memory prefetching complements prefetching done by hardware. Developers can specify which memory locations are needed ahead of time via dedicated hardware instruction (see [@sec:memPrefetch]). Compilers can also automatically add prefetch instructions into the code to request data before it is required. Prefetch techniques need to balance between demand and prefetch requests to guard against prefetch traffic slowing down demand traffic. ### Main Memory {#sec:UarchMainmemory} -Main memory is the next level of the hierarchy, downstream from the caches. Requests to load and store data are initiated by the Memory Controller Unit (MCU). In the past, this circuit was located in the north bridge chip on the motherboard. But nowadays, most processors have this component embedded, so the CPU has a dedicated memory bus connecting it to the main memory. +Main memory is the next level of the hierarchy, downstream from the caches. Requests to load and store data are initiated by the Memory Controller Unit (MCU). In the past, this circuit was located in the northbridge chip on the motherboard. But nowadays, most processors have this component embedded, so the CPU has a dedicated memory bus connecting it to the main memory. Main memory uses DRAM (Dynamic Random Access Memory) technology that supports large capacities at reasonable cost points. When comparing DRAM modules, people usually look at memory density and memory speed, along with its price of course. Memory density defines how much memory the module has measured in GB. Obviously, the more available memory the better as it is a precious resource used by the OS and applications. @@ -111,7 +111,7 @@ It is worth mentioning that DRAM chips require their memory cells to be refreshe A DRAM module is organized as a set of DRAM chips. Memory *rank* is a term that describes how many sets of DRAM chips exist on a module. For example, a single-rank (1R) memory module contains one set of DRAM chips. A dual-rank (2R) memory module has two sets of DRAM chips, therefore doubling the capacity of a single-rank module. Likewise, there are quad-rank (4R) and octa-rank (8R) memory modules available for purchase. -Each rank consists of multiple DRAM chips. Memory *width* defines how wide the bus of each DRAM chip is. And since each rank is 64-bits wide (or 72-bits wide for ECC RAM), it also defines the number of DRAM chips present within the rank. Memory width can be one of three values: `x4`, `x8` or `x16`, which define how wide the bus that goes to each chip. As an example, Figure @fig:Dram_ranks shows the organization of a 2Rx16 dual-rank DRAM DDR4 module, with a total of 2GB capacity. There are four chips in each rank, with a 16-bit wide bus. Combined, the four chips provide 64-bit output. The two ranks are selected one at a time through a rank-select signal. +Each rank consists of multiple DRAM chips. Memory *width* defines how wide the bus of each DRAM chip is. And since each rank is 64 bits wide (or 72 bits wide for ECC RAM), it also defines the number of DRAM chips present within the rank. Memory width can be one of three values: `x4`, `x8` or `x16`, which define how wide the bus that goes to each chip. As an example, Figure @fig:Dram_ranks shows the organization of a 2Rx16 dual-rank DRAM DDR4 module, with a total of 2GB capacity. There are four chips in each rank, with a 16-bit wide bus. Combined, the four chips provide 64-bit output. The two ranks are selected one at a time through a rank-select signal. ![Organization of a 2Rx16 dual-rank DRAM DDR4 module with a total capacity of 2GB.](../../img/uarch/DRAM_ranks.png){#fig:Dram_ranks width=90%} @@ -119,7 +119,7 @@ There is no direct answer as to whether the performance of single-rank or dual-r Going further, we can install multiple DRAM modules in a system to not only increase memory capacity but also memory bandwidth. Setups with multiple memory channels are used to scale up the communication speed between the memory controller and the DRAM. -A system with a single memory channel has a 64-bit wide data bus between the DRAM and memory controller. The multi-channel architectures increase the width of the memory bus, allowing DRAM modules to be accessed simultaneously. For example, the dual-channel architecture expands the width of the memory data bus from 64 bits to 128 bits, doubling the available bandwidth, see Figure @fig:Dram_channels. Notice, that each memory module, is still a 64-bit device, but we connect them differently. It is very typical nowadays for server machines to have four and eight memory channels. +A system with a single memory channel has a 64 bit wide data bus between the DRAM and memory controller. The multi-channel architectures increase the width of the memory bus, allowing DRAM modules to be accessed simultaneously. For example, the dual-channel architecture expands the width of the memory data bus from 64 bits to 128 bits, doubling the available bandwidth, see Figure @fig:Dram_channels. Notice, that each memory module, is still a 64-bit device, but we connect them differently. It is very typical nowadays for server machines to have four and eight memory channels. ![Organization of a dual-channel DRAM setup.](../../img/uarch/DRAM_channels.png){#fig:Dram_channels width=60%} @@ -130,7 +130,7 @@ $$ \textrm{Max. Memory Bandwidth} = \textrm{Data Rate } \times \textrm{ Bytes per cycle } $$ -For example, for a single-channel DDR4 configuration, with the data rate of 2400 MT/s and 64 bits (8 bytes) can be transferred each memory cycle, the maximum bandwidth equals `2400 * 8 = 19.2 GB/s`. Dual-channel or dual memory controller setups double the bandwidth to 38.4 GB/s. Remember though, those numbers are theoretical maximums that assume that a data transfer will occur at each memory clock cycle, which in fact never happens in practice. So, when measuring actual memory speed, you will always see a value lower than the maximum theoretical transfer bandwidth. +For example, for a single-channel DDR4 configuration with a data rate of 2400 MT/s and 64 bits (8 bytes) per transfer, the maximum bandwidth equals `2400 * 8 = 19.2 GB/s`. Dual-channel or dual memory controller setups double the bandwidth to 38.4 GB/s. Remember though, those numbers are theoretical maximums that assume that a data transfer will occur at each memory clock cycle, which in fact never happens in practice. So, when measuring actual memory speed, you will always see a value lower than the maximum theoretical transfer bandwidth. To enable multi-channel configuration, you need to have a CPU and motherboard that support such an architecture and install an even number of identical memory modules in the correct memory slots on the motherboard. The quickest way to check the setup on Windows is by running a hardware identification utility like `CPU-Z` or `HwInfo`; on Linux, you can use the `dmidecode` command. Alternatively, you can run memory bandwidth benchmarks like Intel MLC or Stream. @@ -150,4 +150,4 @@ GDDR was primarily designed for graphics and nowadays it is used on virtually ev HBM is a new type of CPU/GPU memory that vertically stacks memory chips, also called 3D stacking. Similar to GDDR, HBM drastically shortens the distance data needs to travel to reach a processor. The main difference from DDR and GDDR is that the HBM memory bus is very wide: 1024 bits for each HBM stack. This enables HBM to achieve ultra-high bandwidth. The latest HBM3 standard supports up to 665 GB/s bandwidth per package. It also operates at a low frequency of 500 MHz and has a memory density of up to 48 GB per package. -A system with HBM onboard will be a good choice if you want to maximize data transfer throughput. However, at the time of writing, this technology is quite expensive. As GDDR is predominantly used in graphics cards, HBM may be a good option to accelerate certain workloads that run on a CPU. In fact, the first x86 general-purpose server chips with integrated HBM are now available. \ No newline at end of file +A system with HBM onboard will be a good choice if you want to maximize data transfer throughput. However, at the time of writing, this technology is quite expensive. As GDDR is predominantly used in graphics cards, HBM may be a good option to accelerate certain workloads that run on a CPU. In fact, the first x86 general-purpose server chips with integrated HBM are now available. diff --git a/chapters/3-CPU-Microarchitecture/3-7 Virtual memory.md b/chapters/3-CPU-Microarchitecture/3-7 Virtual memory.md index bc6fbb65d6..71869a266e 100644 --- a/chapters/3-CPU-Microarchitecture/3-7 Virtual memory.md +++ b/chapters/3-CPU-Microarchitecture/3-7 Virtual memory.md @@ -26,16 +26,16 @@ A search in a hierarchical page table could be expensive, requiring traversing t To lower memory access latency, the L1 cache lookup can be partially overlapped with the DTLB lookup thanks to a constraint on the cache associativity and size that allows the L1 set selection without the physical address.[^1] However, higher level caches (L2 and L3) - which are also typically Physically Indexed and Physically Tagged (PIPT) but cannot benefit from this optimization - therefore require the address translation before the cache lookup. -The TLB hierarchy keeps translations for a relatively large memory space. Still, misses in the TLB can be very costly. To speed up the handling of TLB misses, CPUs have a mechanism called a *hardware page walker*. Such a unit can perform a page walk directly in hardware by issuing the required instructions to traverse the page table, all without interrupting the kernel. This is the reason why the format of the page table is dictated by the CPU, to which OS’es have to comply. High-end processors have several hardware page walkers that can handle multiple TLB misses simultaneously. However, even with all the acceleration offered by modern CPUs, TLB misses still cause performance bottlenecks for many applications. +The TLB hierarchy keeps translations for a relatively large memory space. Still, misses in the TLB can be very costly. To speed up the handling of TLB misses, CPUs have a mechanism called a *hardware page walker*. Such a unit can perform a page walk directly in hardware by issuing the required instructions to traverse the page table, all without interrupting the kernel. This is the reason why the format of the page table is dictated by the CPU, to which operating systems must comply. High-end processors have several hardware page walkers that can handle multiple TLB misses simultaneously. However, even with all the acceleration offered by modern CPUs, TLB misses still cause performance bottlenecks for many applications. ### Huge Pages {#sec:ArchHugePages} -Having a small page size makes it possible to manage the available memory more efficiently and reduce fragmentation. The drawback though is that it requires having more page table entries to cover the same memory region. Consider two page sizes: 4KB, which is a default on x86, and 2MB *huge page* size. For an application that operates on 10MB data, we need 2560 entries in the first case, and just 5 entries if we would map the address space onto huge pages. Those are named *Huge Pages* on Linux, *Super Pages* on FreeBSD, and *Large Pages* on Windows, but they all mean the same thing. Throughout the rest of this book, we will refer to them as Huge Pages. +Having a small page size makes it possible to manage the available memory more efficiently and reduce fragmentation. The drawback though is that it requires having more page table entries to cover the same memory region. Consider two page sizes: 4KB, which is a default on x86, and a 2MB *huge page* size. For an application that operates on 10MB of data, we need 2560 entries in the first case, but just 5 entries if we would map the address space onto huge pages. Those are named *Huge Pages* on Linux, *Super Pages* on FreeBSD, and *Large Pages* on Windows, but they all mean the same thing. Throughout the rest of this book, we will refer to them as Huge Pages. An example of an address that points to data within a huge page is shown in Figure @fig:HugePageVirtualAddress. Just like with a default page size, the exact address format when using huge pages is dictated by the hardware, but luckily we as programmers usually don't have to worry about it. ![Virtual address that points within a 2MB page.](../../img/uarch/HugePageVirtualAddress.png){#fig:HugePageVirtualAddress width=90%} -Using huge pages drastically reduces the pressure on the TLB hierarchy since fewer TLB entries are required. It greatly increases the chance of a TLB hit. We will discuss how to use huge pages to reduce the frequency of TLB misses in [@sec:secDTLB] and [@sec:FeTLB]. The downsides of using huge pages are memory fragmentation and, in some cases, non-deterministic page allocation latency because it is harder for the operating system to manage large blocks of memory and to ensure effective utilization of available memory. To satisfy a 2MB huge page allocation request at runtime, an OS needs to find a contiguous chunk of 2MB. If this cannot be found, the OS needs to reorganize the pages, resulting in a longer allocation latency. +Using huge pages drastically reduces the pressure on the TLB hierarchy since fewer TLB entries are required. It greatly increases the chance of a TLB hit. We will discuss how to use huge pages to reduce the frequency of TLB misses in [@sec:secDTLB] and [@sec:FeTLB]. The downsides of using huge pages are memory fragmentation and, in some cases, nondeterministic page allocation latency because it is harder for the operating system to manage large blocks of memory and to ensure effective utilization of available memory. To satisfy a 2MB huge page allocation request at runtime, an OS needs to find a contiguous chunk of 2MB. If this cannot be found, the OS needs to reorganize the pages, resulting in a longer allocation latency. -[^1]: Minimum associativity for a PIPT L1 cache to also be VIPT, accessing a set without translating the index to physical - [https://stackoverflow.com/a/59288185](https://stackoverflow.com/a/59288185). \ No newline at end of file +[^1]: Minimum associativity for a PIPT L1 cache to also be VIPT, accessing a set without translating the index to physical - [https://stackoverflow.com/a/59288185](https://stackoverflow.com/a/59288185). diff --git a/chapters/3-CPU-Microarchitecture/3-8 Modern CPU design.md b/chapters/3-CPU-Microarchitecture/3-8 Modern CPU design.md index d83c5c2458..c2bf631b19 100644 --- a/chapters/3-CPU-Microarchitecture/3-8 Modern CPU design.md +++ b/chapters/3-CPU-Microarchitecture/3-8 Modern CPU design.md @@ -2,29 +2,29 @@ ## Modern CPU Design -To see how all the concepts we talked about in this chapter are used in practice, let's take a look at the implementation of Intel’s 12th-generation core, Goldencove, which became available in 2021. This core is used as the P-core inside the Alderlake and Sapphire Rapids platforms. Figure @fig:Goldencove_diag shows the block diagram of the Goldencove core. Notice, that this section only describes a single core, not the entire processor. So, we will skip the discussion about frequencies, core counts, L3 caches, core interconnects, memory latency and bandwidth, and other things. +To see how all the concepts we talked about in this chapter are used in practice, let's take a look at the implementation of Intel’s 12th-generation core, Goldencove, which became available in 2021. This core is used as the P-core inside the Alderlake and Sapphire Rapids platforms. Figure @fig:Goldencove_diag shows the block diagram of the Goldencove core. Notice that this section only describes a single core, not the entire processor. So, we will skip the discussion about frequencies, core counts, L3 caches, core interconnects, memory latency and bandwidth, and other things. ![Block diagram of a CPU Core in the Intel GoldenCove Microarchitecture.](../../img/uarch/goldencove_block_diagram.png){#fig:Goldencove_diag width=100%} -The core is split into an in-order front-end that fetches and decodes x86 instructions into $\mu$ops and a 6-wide superscalar, out-of-order backend. The Goldencove core supports 2-way SMT. It has a 32KB first-level instruction cache (L1 I-cache), and a 48KB first-level data cache (L1 D-cache). The L1 caches are backed up by a unified 1.25MB (2MB in server chips) L2 cache. The L1 and L2 caches are private to each core. At the end of this section, we also take a look at the TLB hierarchy. +The core is split into an in-order frontend that fetches and decodes x86 instructions into $\mu$ops and a 6-wide superscalar, out-of-order backend. The Goldencove core supports 2-way SMT. It has a 32KB first-level instruction cache (L1 I-cache), and a 48KB first-level data cache (L1 D-cache). The L1 caches are backed up by a unified 1.25MB (2MB in server chips) L2 cache. The L1 and L2 caches are private to each core. At the end of this section, we also take a look at the TLB hierarchy. -### CPU Front-End {#sec:uarchFE} +### CPU Frontend {#sec:uarchFE} -The CPU Front-End consists of several functional units that fetch and decode instructions from memory. Its main purpose is to feed prepared instructions to the CPU Back-End, which is responsible for the actual execution of instructions. +The CPU Frontend consists of several functional units that fetch and decode instructions from memory. Its main purpose is to feed prepared instructions to the CPU Back-End, which is responsible for the actual execution of instructions. -Technically, instruction fetch is the first stage to execute an instruction. But once a program reaches a steady state, the branch predictor unit (BPU) steers the work of the CPU Front-End. That is indiciated with the arrow that goes from the BPU to the instruction cache. The BPU predicts the direction of all branch instructions and steers the next instruction fetch based on this prediction. +Technically, instruction fetch is the first stage to execute an instruction. But once a program reaches a steady state, the branch predictor unit (BPU) steers the work of the CPU Frontend. That is indiciated with the arrow that goes from the BPU to the instruction cache. The BPU predicts the target of all branch instructions and steers the next instruction fetch based on this prediction. -The heart of the BPU is a branch target buffer (BTB) with 12K entries containing information about branches and their targets. This information is used by the prediction algorithms. Every cycle, the BPU generates the next fetch address and passes it to the CPU Front-End. +The heart of the BPU is a branch target buffer (BTB) with 12K entries containing information about branches and their targets. This information is used by the prediction algorithms. Every cycle, the BPU generates the next fetch address and passes it to the CPU Frontend. -The CPU Front-End fetches 32 bytes per cycle of x86 instructions from the L1 I-cache. This is shared among the two threads, so each thread gets 32 bytes every other cycle. These are complex, variable-length x86 instructions. First, the pre-decode stage determines and marks the boundaries of the variable instructions by inspecting the instruction. In x86, the instruction length can range from 1-byte to 15-byte instructions. This stage also identifies branch instructions. The pre-decode stage moves up to 6 instructions (also referred to as *macroinstructions*) to the Instruction Queue that is split between the two threads. The instruction queue also supports a macro-op fusion unit that detects that two macroinstructions can be fused into a single micro-operation ($\mu$op). This optimization saves bandwidth in the rest of the pipeline. +The CPU Frontend fetches 32 bytes per cycle of x86 instructions from the L1 I-cache. This is shared among the two threads, so each thread gets 32 bytes every other cycle. These are complex, variable-length x86 instructions. First, the pre-decode stage determines and marks the boundaries of the variable instructions by inspecting the chunk. In x86, the instruction length can range from 1 to 15 bytes. This stage also identifies branch instructions. The pre-decode stage moves up to 6 instructions (also referred to as *macroinstructions*) to the Instruction Queue that is split between the two threads. The instruction queue also supports a macro-op fusion unit that detects when two macroinstructions can be fused into a single micro-operation ($\mu$op). This optimization saves bandwidth in the rest of the pipeline. Later, up to six pre-decoded instructions are sent from the Instruction Queue to the decoder unit every cycle. The two SMT threads alternate every cycle to access this interface. The 6-way decoder converts the complex macro-Ops into fixed-length $\mu$ops. Decoded $\mu$ops are queued into the Instruction Decode Queue (IDQ), labeled as "$\mu$op Queue" on the diagram. -A major performance-boosting feature of the Front-End is the $\mu$op Cache. Also, you could often see people call it Decoded Stream Buffer (DSB). The motivation is to cache the macro-ops to $\mu$ops conversion in a separate structure that works in parallel with the L1 I-cache. When the BPU generates a new address to fetch, the $\mu$op Cache is also checked to see if the $\mu$ops translations are already available. Frequently occurring macro-ops will hit in the $\mu$op Cache, and the pipeline will avoid repeating the expensive pre-decode and decode operations for the 32-byte bundle. The $\mu$op Cache can provide eight $\mu$ops per cycle and can hold up to 4K entries. +A major performance-boosting feature of the Frontend is the $\mu$op Cache. Also, you could often see people call it Decoded Stream Buffer (DSB). The motivation is to cache the macro-ops to $\mu$ops conversion in a separate structure that works in parallel with the L1 I-cache. When the BPU generates a new address to fetch, the $\mu$op Cache is also checked to see if the $\mu$ops translations are already available. Frequently occurring macro-ops will hit in the $\mu$op Cache, and the pipeline will avoid repeating the expensive pre-decode and decode operations for the 32-byte bundle. The $\mu$op Cache can provide eight $\mu$ops per cycle and can hold up to 4K entries. Some very complicated instructions may require more $\mu$ops than decoders can handle. $\mu$ops for such instruction are served from the Microcode Sequencer (MSROM). Examples of such instructions include hardware operation support for string manipulation, encryption, synchronization, and others. Also, MSROM keeps the microcode operations to handle exceptional situations like branch misprediction (which requires a pipeline flush), floating-point assist (e.g., when an instruction operates with a denormalized floating-point value), and others. MSROM can push up to 4 $\mu$ops per cycle into the IDQ. -The Instruction Decode Queue (IDQ) provides the interface between the in-order frontend and the out-of-order backend. The IDQ queues up the $\mu$ops in order and can hold 144 $\mu$ops per logical processor in single thread mode, or 72 $\mu$ops per thread when SMT is active. This is where the in-order CPU Front-End finishes and the out-of-order CPU Back-End starts. +The Instruction Decode Queue (IDQ) provides the interface between the in-order frontend and the out-of-order backend. The IDQ queues up the $\mu$ops in order and can hold 144 $\mu$ops per logical processor in single thread mode, or 72 $\mu$ops per thread when SMT is active. This is where the in-order CPU Frontend finishes and the out-of-order CPU Back-End starts. ### CPU Back-End {#sec:uarchBE} @@ -36,16 +36,16 @@ The heart of the OOO engine is the 512-entry ReOrder Buffer (ROB). It serves a f Second, the ROB allocates execution resources. When an instruction enters the ROB, a new entry is allocated and resources are assigned to it, mainly an execution unit and the destination physical register. The ROB can allocate up to 6 $\mu$ops per cycle. -Third, the ROB tracks the speculative execution. When an instruction has finished its execution, its status gets updated and it stays there until the previous instructions also finished. It's done that way because instructions must retire in program order. Once an instruction retires, its ROB entry is deallocated and the results of the instruction become visible. The retiring stage is wider than the allocation: the ROB can retire 8 instructions per cycle. +Third, the ROB tracks speculative execution. When an instruction has finished its execution, its status gets updated and it stays there until the previous instructions finish. It's done that way because instructions must retire in program order. Once an instruction retires, its ROB entry is deallocated and the results of the instruction become visible. The retiring stage is wider than the allocation: the ROB can retire 8 instructions per cycle. There are certain operations that processors handle in a specific manner, often called idioms, which require no or less costly execution. Processors recognize such cases and allow them to run faster than regular instructions. Here are some of such cases: -* **Zeroing**: to assign zero to a register, compilers often use `XOR / PXOR / XORPS / XORPD` instructions, e.g., `XOR RAX, RAX`, which are preferred by compilers instead of the equivalent `MOV RAX, 0x0` instruction as the XOR encoding uses fewer encoding bytes. Such zeroing idioms are not executed as any other regular instruction and are resolved in the CPU front-end, which saves execution resources. The instruction later retires as usual. +* **Zeroing**: to assign zero to a register, compilers often use `XOR / PXOR / XORPS / XORPD` instructions, e.g., `XOR RAX, RAX`, which are preferred by compilers instead of the equivalent `MOV RAX, 0x0` instruction as the XOR encoding uses fewer encoding bytes. Such zeroing idioms are not executed as any other regular instruction and are resolved in the CPU frontend, which saves execution resources. The instruction later retires as usual. * **Move elimination**: similar to the previous one, register-to-register `mov` operations, e.g., `MOV RAX, RBX`, are executed with zero cycle delay. * **NOP instruction**: `NOP` is often used for padding or alignment purposes. It simply gets marked as completed without allocating it to the reservation station. * **Other bypasses**: CPU architects also optimized certain arithmetic operations. For example, multiplying any number by one will always yield the same number. The same goes for dividing any number by one. Multiplying any number by zero always yields zero, etc. Some CPUs can recognize such cases at runtime and execute them with shorter latency than regular multiplication or divide. -The "Scheduler / Reservation Station" (RS) is the structure that tracks the availability of all resources for a given $\mu$op and dispatches the $\mu$op to an *execution port* once it is ready. Execution port is a pathway that connects the scheduler to its execution units. Each execution port may be connected to multiple execution units. When an instruction enters the RS, the scheduler starts tracking its data dependencies. Once all the source operands become available, the RS attempts to dispatch the $\mu$op to a free execution port. The RS has fewer entries[^4] than the ROB. It can dispatch up to 6 $\mu$ops per cycle. +The "Scheduler / Reservation Station" (RS) is the structure that tracks the availability of all resources for a given $\mu$op and dispatches the $\mu$op to an *execution port* once it is ready. An execution port is a pathway that connects the scheduler to its execution units. Each execution port may be connected to multiple execution units. When an instruction enters the RS, the scheduler starts tracking its data dependencies. Once all the source operands become available, the RS attempts to dispatch the $\mu$op to a free execution port. The RS has fewer entries[^4] than the ROB. It can dispatch up to 6 $\mu$ops per cycle. We repeated a part of the diagram that depicts the GoldenCove execution engine and Load-Store unit in Figure @fig:Goldencove_BE_LSU. There are 12 execution ports: @@ -58,7 +58,7 @@ We repeated a part of the diagram that depicts the GoldenCove execution engine a Instructions that require memory operations are handled by the Load-Store unit (ports 2, 3, 11, 4, 9, 7, and 8) that we will discuss in the next section. If an operation does not involve loading or storing data, then it will be dispatched to the execution engine (ports 0, 1, 5, 6, and 10). -For example, an `Integer Shift` operation can go only to either port 0 or 6, while a `Floating-Point Divide` operation can only be dispatched to port 0. In a situation when a scheduler has to dispatch two operations that require the same execution port, one of them will have to be dispatched in the next cycle. +For example, an `Integer Shift` operation can go only to either port 0 or 6, while a `Floating-Point Divide` operation can only be dispatched to port 0. In a situation when a scheduler has to dispatch two operations that require the same execution port, one of them will have to be delayed. The VEC/FP stack does floating-point scalar and *all* packed (SIMD) operations. For instance, ports 0, 1, and 5 can handle ALU operations of the following types: packed integer, packed floating-point, and scalar floating-point. Integer and Vector/FP register stacks are located separately. Operations that move values from the INT stack to VEC/FP and vice-versa (e.g., convert, extract or insert) incur additional penalties. @@ -66,25 +66,25 @@ The VEC/FP stack does floating-point scalar and *all* packed (SIMD) operations. The Load-Store Unit (LSU) is responsible for operations with memory. The Goldencove core can issue up to three loads (three 256-bit or two 512-bit) by using ports 2, 3, and 11. AGU stands for Address Generation Unit, which is required to access a memory location. It can also issue up to two stores (two 256-bit or one 512-bit) per cycle via ports 4, 9, 7, and 8. STD stands for Store Data. -Notice, that AGU is required for both load and store operations to perform dynamic address calculation. For example, in the instruction `vmovss DWORD PTR [rsi+0x4],xmm0`, the AGU will be responsible for calculating `rsi+0x4` address, which will be used to store data from xmm0. +Notice that the AGU is required for both load and store operations to perform dynamic address calculation. For example, in the instruction `vmovss DWORD PTR [rsi+0x4],xmm0`, the AGU will be responsible for calculating `rsi+0x4`, which will be used to store data from xmm0. -Once a load or a store leaves the scheduler, LSU is responsible for accessing the data. Load operations save the fetched value in a register. Store operations transfer value from a register to a location in memory. LSU has a Load Buffer (also known as Load Queue) and a Store Buffer (also known as Store Queue), their sizes are not disclosed.[^2] Both Load Buffer and Store Buffer receive operations at dispatch from the scheduler. +Once a load or a store leaves the scheduler, LSU is responsible for accessing the data. Load operations save the fetched value in a register. Store operations transfer value from a register to a location in memory. LSU has a Load Buffer (also known as Load Queue) and a Store Buffer (also known as Store Queue); their sizes are not disclosed.[^2] Both Load Buffer and Store Buffer receive operations at dispatch from the scheduler. When a memory load request comes, the LSU queries the L1 cache using a virtual address and looks up the physical address translation in the TLB. Those two operations are initiated simultaneously. The size of the L1 D-cache is 48KB. If both operations result in a hit, the load delivers data to the integer or floating-point register and leaves the Load Buffer. Similarly, a store would write the data to the data cache and exit the Store Buffer. -In case of an L1 miss, the hardware initiates a query of the (private) L2 cache tags. The L2 cache comes in two variants: 1.25MB for client and 2MB for server processors. While the L2 cache is being queried, a 64-byte wide fill buffer (FB) entry is allocated, which will keep the cache line once it arrives. The Goldencove core has 16 fill buffers. As a way to lower the latency, a speculative query is sent to the L3 cache in parallel with the L2 cache lookup. +In case of an L1 miss, the hardware initiates a query of the (private) L2 cache tags. While the L2 cache is being queried, a 64-byte wide fill buffer (FB) entry is allocated, which will keep the cache line once it arrives. The Goldencove core has 16 fill buffers. As a way to lower the latency, a speculative query is sent to the L3 cache in parallel with the L2 cache lookup. If two loads access the same cache line, they will hit the same FB. Such two loads will be "glued" together and only one memory request will be initiated. LSU dynamically reorders operations, supporting both loads bypassing older loads and loads bypassing older non-conflicting stores. Also, LSU supports store-to-load forwarding when there is an older store containing all of the load's bytes, and the store's data has been produced and is available in the store queue. In case the L2 miss is confirmed, the load continues to wait for the results of the L3 cache, which incurs much higher latency. From that point, the request leaves the core and enters the *uncore*, the term you may frequently see in profiling tools. The outstanding misses from the core are tracked in the Super Queue (SQ, not shown on the diagram), which can track up to 48 uncore requests. In a scenario of L3 miss, the processor begins to set up a memory access. Further details are beyond the scope of this chapter. -When a store happens, in a general case, to modify a memory location, the processor needs to load the full cache line, change it, and then write it back to memory. If the address to write is not in the cache, it goes through a very similar mechanism as with loads to bring that data in. The store cannot be complete until the data is written to the cache hierarchy. +When a store modifies a memory location, the processor needs to load the full cache line, change it, and then write it back to memory. If the address to write is not in the cache, it goes through a very similar mechanism as with loads to bring that data in. The store cannot be complete until the data is written to the cache hierarchy. Of course, there are a few optimizations done for store operations as well. First, if we're dealing with a store or multiple adjacent stores (also known as *streaming stores*) that modify an entire cache line, there is no need to read the data first as all of the bytes will be clobbered anyway. So, the processor will try to combine writes to fill an entire cache line. If this succeeds no memory read operation is needed at all. Second, write combining enables multiple stores to be assembled and written further out in the cache hierarchy as a unit. So, if multiple stores modify the same cache line, only one memory write will be issued to the memory subsystem. All these optimizations are done inside the Store Buffer. A store instruction copies the data that will be written from a register into the Store Buffer. From there it may be written to the L1 cache or it may be combined with other stores to the same cache line. The Store Buffer capacity is limited, so it can hold requests for partial writing to a cache line only for some time. However, while the data sits in the Store Buffer waiting to be written, other load instructions can read the data straight from the store buffers (store-to-load forwarding). -Finally, if we happen to read data before overwriting it, the cache line typically stays in the cache, displacing some other line. This behavior can be altered with the help of a *non-temporal* store, a special CPU instruction that will not keep the modified line in the cache. It is useful in situations when we know we will not need the data once we have changed it. Non-temporal stores help to utilize cache space more efficiently by not evicting other data that might be needed soon. +Finally, if we happen to read data before overwriting it, the cache line typically stays in the cache. This behavior can be altered with the help of a *non-temporal* store, a special CPU instruction that will not keep the modified line in the cache. It is useful in situations when we know we will not need the data once we have changed it. Non-temporal stores help to utilize cache space more efficiently by not evicting other data that might be needed soon. During a typical program execution, there could be dozens of memory accesses in flight. In most high-performance processors, the order of load and store operations is not necessarily required to be the same as the program order, which is known as a _weakly ordered memory model_. For optimization purposes, the processor can reorder memory read and write operations. Consider a situation when a load runs into a cache miss and has to wait until the data comes from memory. The processor allows subsequent loads to proceed ahead of the load that is waiting for the data. This allows later loads to finish before the earlier load and doesn't unnecessarily block the execution. Such load/store reordering enables the memory units to process multiple memory accesses in parallel, which translates directly into higher performance. @@ -115,21 +115,21 @@ It's worth mentioning that forwarding from a store to a load occurs in real code ### TLB Hierarchy -Recall from [@sec:TLBs] that translations from virtual to physical addresses are cached in the TLB. Golden Cove's TLB hierarchy is presented in Figure @fig:GLC_TLB. Similar to a regular data cache, it has two levels, where level 1 has separate instances for instructions (ITLB) and data (DTLB). L1 ITLB has 256 entries for regular 4K pages and covers the memory space of 256 * 4KB equals 1MB, while L1 DTLB has 96 entries that cover 384 KB. +Recall from [@sec:TLBs] that translations from virtual to physical addresses are cached in the TLB. Golden Cove's TLB hierarchy is presented in Figure @fig:GLC_TLB. Similar to a regular data cache, it has two levels, where level 1 has separate instances for instructions (ITLB) and data (DTLB). L1 ITLB has 256 entries for regular 4K pages and covers 1MB of memory, while L1 DTLB has 96 entries that cover 384 KB. ![TLB hierarchy of Intel's GoldenCove microarchitecture.](../../img/uarch/GLC_TLB_hierarchy.png){#fig:GLC_TLB width=60%} -The second level of the hierarchy (STLB) caches translations for both instructions and data. It is a larger storage for serving requests that missed in the L1 TLBs. L2 STLB can accommodate 2048 most recent data and instruction page address translations, which covers a total of 8MB of memory space. There are fewer entries available for 2MB huge pages: L1 ITLB has 32 entries, L1 DTLB has 32 entries, and L2 STLB can only use 1024 entries that are also shared with regular 4KB pages. +The second level of the hierarchy (STLB) caches translations for both instructions and data. It is a larger storage for serving requests that missed in the L1 TLBs. L2 STLB can accommodate 2048 recent data and instruction page address translations, which covers a total of 8MB of memory space. There are fewer entries available for 2MB huge pages: L1 ITLB has 32 entries, L1 DTLB has 32 entries, and L2 STLB can only use 1024 entries that are also shared with regular 4KB pages. -In case a translation was not found in the TLB hierarchy, it has to be retrieved from the DRAM by "walking" the kernel page tables. There is a mechanism for speeding up such scenarios, called hardware page walker. Recall that the page table is built as a radix tree of sub-tables, with each entry of the sub-table holding a pointer to the next level of the tree. +In case a translation was not found in the TLB hierarchy, it has to be retrieved from the DRAM by "walking" the kernel page tables. Recall that the page table is built as a radix tree of subtables, with each entry of the subtable holding a pointer to the next level of the tree. -The key element to speed up the page walk procedure is a set of Paging-Structure Caches[^3] that caches the hot entries in the page table structure. For the 4-level page table, we have the least significant twelve bits (11:0) for page offset (not translated), and bits 47:12 for the page number. While each entry in a TLB is an individual complete translation, Paging-Structure Caches cover only the upper 3 levels (bits 47:21). The idea is to reduce the number of loads required to execute in case of a TLB miss. For example, without such caches, we would have to execute 4 loads, which would add latency to the instruction completion. But with the help of the Paging-Structure Caches, if we find a translation for levels 1 and 2 of the address (bits 47:30), we only have to do the remaining 2 loads. +The key element to speed up the page walk procedure is a set of Paging-Structure Caches[^3] that cache the hot entries in the page table structure. For the 4-level page table, we have the least significant twelve bits (11:0) for page offset (not translated), and bits 47:12 for the page number. While each entry in a TLB is an individual complete translation, Paging-Structure Caches cover only the upper 3 levels (bits 47:21). The idea is to reduce the number of loads required to execute in case of a TLB miss. For example, without such caches, we would have to execute 4 loads, which would add latency to the instruction completion. But with the help of the Paging-Structure Caches, if we find a translation for levels 1 and 2 of the address (bits 47:30), we only have to do the remaining 2 loads. The Goldencove microarchitecture has four dedicated page walkers, which allows it to process 4 page walks simultaneously. In the event of a TLB miss, these hardware units will issue the required loads into the memory subsystem and populate the TLB hierarchy with new entries. The page-table loads generated by the page walkers can hit in L1, L2, or L3 caches (details are not disclosed). Finally, page walkers can anticipate a future TLB miss and speculatively do a page walk to update TLB entries before a miss actually happens. -The Goldencove specification doesn't disclose how resources are shared between two SMT threads. But in general, caches, TLBs and execution units are fully shared to improve the dynamic utilization of those resources. On the other hand, buffers for staging instructions between major pipe stages are either replicated or partitioned. These buffers include IDQ, ROB, RAT, RS, Load Buffer and Store Buffer. PRF is also replicated. +The Goldencove specification doesn't disclose how resources are shared between two SMT threads. But in general, caches, TLBs and execution units are fully shared to improve the dynamic utilization of those resources. On the other hand, buffers for staging instructions between major pipe stages are either replicated or partitioned. These buffers include IDQ, ROB, RAT, RS, Load Buffer and the Store Buffer. PRF is also replicated. [^1]: There are around 300 physical general-purpose registers (GPRs) and a similar number of vector registers. The actual number of registers is not disclosed. [^2]: Load Buffer and Store Buffer sizes are not disclosed, but people have measured 192 and 114 entries respectively. [^3]: AMD's equivalent is called Page Walk Caches. -[^4]: People have measured ~200 entries in the RS, however the actual number of registers is not disclosed. \ No newline at end of file +[^4]: People have measured ~200 entries in the RS, however the actual number of entries is not disclosed. diff --git a/chapters/3-CPU-Microarchitecture/3-9 PMU.md b/chapters/3-CPU-Microarchitecture/3-9 PMU.md index e9d5c63cf9..a225bfb706 100644 --- a/chapters/3-CPU-Microarchitecture/3-9 PMU.md +++ b/chapters/3-CPU-Microarchitecture/3-9 PMU.md @@ -2,7 +2,7 @@ ## Performance Monitoring Unit {#sec:PMU} -Every modern CPU provides facilities to monitor performance, which are combined into the Performance Monitoring Unit (PMU). This unit incorporates features that help developers to analyze the performance of their applications. An example of a PMU in a modern Intel CPU is provided in Figure @fig:PMU. Most modern PMUs have a set of Performance Monitoring Counters (PMC) that can be used to collect various performance events that happen during the execution of a program. Later in [@sec:counting], we will discuss how PMCs can be used for performance analysis. Also, the PMU has other features that enhance performance analysis, like LBR, PEBS, and PT, a topic to which [@sec:PmuChapter] is devoted. +Every modern CPU provides facilities to monitor performance, which are combined into the Performance Monitoring Unit (PMU). This unit incorporates features that help developers analyze the performance of their applications. An example of a PMU in a modern Intel CPU is provided in Figure @fig:PMU. Most modern PMUs have a set of Performance Monitoring Counters (PMC) that can be used to collect various performance events that happen during the execution of a program. Later in [@sec:counting], we will discuss how PMCs can be used for performance analysis. Also, the PMU has other features that enhance performance analysis, like LBR, PEBS, and PT, topics to which [@sec:PmuChapter] is devoted. ![Performance Monitoring Unit of a modern Intel CPU.](../../img/uarch/PMU.png){#fig:PMU width=100%} @@ -30,7 +30,7 @@ If we imagine a simplified view of a processor, it may look something like what ![Simplified view of a CPU with a performance monitoring counter.](../../img/uarch/PMC.png){#fig:PMC width=60%} -Typically, PMCs are 48-bit wide, which enables analysis tools to run for a long time without interrupting a program's execution.[^2] Performance counter is a hardware register implemented as a Model-Specific Register (MSR). That means the number of counters and their width can vary from model to model, and you cannot rely on the same number of counters in your CPU. You should always query that first, using tools like `cpuid`, for example. PMCs are accessible via the `RDMSR` and `WRMSR` instructions, which can only be executed from kernel space. Luckily, you only have to care about this if you're a developer of a performance analysis tool, like Linux perf or Intel VTune profiler. Those tools handle all the complexity of programming PMCs. +Typically, PMCs are 48-bit wide, which enables analysis tools to run for a long time without interrupting a program's execution.[^2] Performance counter is a hardware register implemented as a Model-Specific Register (MSR). That means the number of counters and their width can vary from model to model, and you cannot rely on the same number of counters in your CPU. You should always query that first, using tools like `cpuid`, for example. PMCs are accessible via the `RDMSR` and `WRMSR` instructions, which can only be executed from kernel space. Luckily, you only have to care about this if you're a developer of a performance analysis tool, like Linux `perf` or Intel VTune profiler. Those tools handle all the complexity of programming PMCs. When engineers analyze their applications, it is common for them to collect the number of executed instructions and elapsed cycles. That is the reason why some PMUs have dedicated PMCs for collecting such events. Fixed counters always measure the same thing inside the CPU core. With programmable counters, it's up to the user to choose what they want to measure. @@ -39,8 +39,8 @@ For example, in the Intel Skylake architecture (PMU version 4, see [@lst:QueryPM It's not unusual for the PMU to provide more than one hundred events available for monitoring. Figure @fig:PMU shows just a small subset of the performance monitoring events available for monitoring on a modern Intel CPU. It's not hard to notice that the number of available PMCs is much smaller than the number of performance events. It's not possible to count all the events at the same time, but analysis tools solve this problem by multiplexing between groups of performance events during the execution of a program (see [@sec:secMultiplex]). - For Intel CPUs, the complete list of performance events can be found in [@IntelOptimizationManual, Volume 3B, Chapter 20] or at [perfmon-events.intel.com](https://perfmon-events.intel.com/). -- ADM doesn't publish a list of performance monitoring events for every AMD processor. Curious readers may find some information in the Linux perf source [code](https://github.com/torvalds/linux/blob/master/arch/x86/events/amd/core.c)[^3]. Also, you can list performance events available for monitoring using the AMD uProf command line tool. General information about AMD performance counters can be found in [@AMDProgrammingManual, 13.2 Performance Monitoring Counters]. +- ADM doesn't publish a list of performance monitoring events for every AMD processor. Curious readers may find some information in the Linux `perf` source [code](https://github.com/torvalds/linux/blob/master/arch/x86/events/amd/core.c)[^3]. Also, you can list performance events available for monitoring using the AMD uProf command line tool. General information about AMD performance counters can be found in [@AMDProgrammingManual, 13.2 Performance Monitoring Counters]. - For ARM chips, performance events are not so well defined. Vendors implement cores following an ARM architecture, but performance events vary widely, both in what they mean and what events are supported. For the ARM Neoverse V1 processor, that ARM designs themselves, the list of performance events can be found in [@ARMNeoverseV1]. [^2]: When the value of PMCs overflows, the execution of a program must be interrupted. The profiling tool then should save the fact of an overflow. We will discuss it in more detail in [@sec:sec_PerfApproaches]. -[^3]: Linux source code for AMD cores - [https://github.com/torvalds/linux/blob/master/arch/x86/events/amd/core.c](https://github.com/torvalds/linux/blob/master/arch/x86/events/amd/core.c) \ No newline at end of file +[^3]: Linux source code for AMD cores - [https://github.com/torvalds/linux/blob/master/arch/x86/events/amd/core.c](https://github.com/torvalds/linux/blob/master/arch/x86/events/amd/core.c)