From 90ccf686e45c84e6e5b46f7ec0d80bec083f4e3e Mon Sep 17 00:00:00 2001 From: Denis Bakhvalov Date: Sat, 14 Sep 2024 14:50:19 -0400 Subject: [PATCH] Update 3-3 Exploiting ILP.md --- chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md b/chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md index cba64356fc..04109953ee 100644 --- a/chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md +++ b/chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md @@ -10,7 +10,7 @@ An instruction is called *retired* after it is finally executed, and its results ![The concept of out-of-order execution.](../../img/uarch/OOO.png){#fig:OOO width=90%} -Figure @fig:OOO illustrates the concept of out-of-order execution. Let's assume that instruction `x+1` cannot be executed during cycles 4 and 5 due to some conflict. An in-order CPU would stall all subsequent instructions from entering the EXE pipeline stage, so instruction `x+2` would begin executing only at cycle 7. In a CPU with OOO execution, an instruction can begin executing as long as it does not have any conflicts (e.g., its inputs are available, execution unit is not occupied, etc.). As you can see on the diagram, instruction `x+2` started executing before instruction `x+1`. All instructions still retire in order, i.e., the instructions complete the WB stage in the program order. +Figure @fig:OOO illustrates the concept of out-of-order execution. Let's assume that instruction `x+1` cannot be executed during cycles 4 and 5 due to some conflict. An in-order CPU would stall all subsequent instructions from entering the EXE pipeline stage, so instruction `x+2` would begin executing only at cycle 7. In a CPU with OOO execution, an instruction can begin executing as long as it does not have any conflicts (e.g., its inputs are available, the execution unit is not occupied, etc.). As you can see on the diagram, instruction `x+2` started executing before instruction `x+1`. All instructions still retire in order, i.e., the instructions complete the WB stage in the program order. The process of reordering instructions is often called instruction *scheduling*. The goal of scheduling is to issue instructions in such a way as to minimize pipeline hazards and maximize the utilization of CPU resources. Instruction scheduling can be done at compile time (static scheduling), or at runtime (dynamic scheduling). Let's unpack both options. @@ -34,11 +34,11 @@ From the ROB, instructions are inserted in the RS, which has much fewer entries. ### Superscalar Engines -Most modern CPUs are superscalar, i.e. they can issue more than one instruction in a given cycle. Issue width is the maximum number of instructions that can be issued during the same cycle. The typical issue width of a mainstream CPU in 2024 ranges from 6 to 9. To ensure the right balance, such superscalar engines also have more than one execution unit and/or pipelined execution units. CPUs also combine superscalar capability with deep pipelines and out-of-order execution to extract the maximum ILP for a given piece of software. +Most modern CPUs are superscalar, i.e., they can issue more than one instruction in a given cycle. Issue width is the maximum number of instructions that can be issued during the same cycle. The typical issue width of a mainstream CPU in 2024 ranges from 6 to 9. To ensure the right balance, such superscalar engines also have more than one execution unit and/or pipelined execution units. CPUs also combine superscalar capability with deep pipelines and out-of-order execution to extract the maximum ILP for a given piece of software. ![A pipeline diagram of a code executing in a 2-way superscalar CPU.](../../img/uarch/SuperScalar.png){#fig:SuperScalar width=65%} -Figure @fig:SuperScalar shows a pipeline diagram of a CPU that supports 2-wide issue. Notice that two instructions can be processed in each stage of the pipeline every cycle. For example, both instructions `x` and `x+1` started their execution during cycle 3. This could be two instructions of the same type (e.g., two additions) or two different instructions (e.g., an addition and a branch). Superscalar processors replicate execution resources to keep instructions in the pipeline flowing through without conflicts. For instance, to support decoding of two instructions simultaneously, we need to have 2 independent decoders. +Figure @fig:SuperScalar shows a pipeline diagram of a CPU that supports 2-wide issue. Notice that two instructions can be processed in each stage of the pipeline every cycle. For example, both instructions `x` and `x+1` started their execution during cycle 3. This could be two instructions of the same type (e.g., two additions) or two different instructions (e.g., an addition and a branch). Superscalar processors replicate execution resources to keep instructions in the pipeline flowing through without conflicts. For instance, to support the decoding of two instructions simultaneously, we need to have 2 independent decoders. ### Speculative Execution {#sec:SpeculativeExec} @@ -82,7 +82,7 @@ Most prediction algorithms are based on previous outcomes of the branch. The cor Unconditional branches do not require prediction; we just need to look up the target address in the BTB. Every cycle the BPU needs to generate the next address from which to fetch instructions to avoid pipeline stalls. We could have extracted the address just from the instruction encoding itself, but then we have to wait until the decode stage is over, which will introduce a bubble in the pipeline and make things slower. So, the next fetch address has to be determined at the time when the branch is fetched. -For conditional branches, we first need predict whether the branch will be taken or not. If it is not taken, then we fall through and there is no need to look up the target. Otherwise, we look up the target address in the BTB. Conditional branches usually account for the biggest portion of total branches and are the main source of misprediction penalties in production software. For indirect branches, we need to select one of the possible targets, but the prediction algorithm can be very similar to conditional branches. +For conditional branches, we first need to predict whether the branch will be taken or not. If it is not taken, then we fall through and there is no need to look up the target. Otherwise, we look up the target address in the BTB. Conditional branches usually account for the biggest portion of total branches and are the main source of misprediction penalties in production software. For indirect branches, we need to select one of the possible targets, but the prediction algorithm can be very similar to conditional branches. All prediction mechanisms try to exploit two important principles, which are similar to what we will discuss with caches later: