diff --git a/chapters/3-CPU-Microarchitecture/3-0 CPU microarchitecture.md b/chapters/3-CPU-Microarchitecture/3-0 CPU microarchitecture.md index 569e287991..5bc809597c 100644 --- a/chapters/3-CPU-Microarchitecture/3-0 CPU microarchitecture.md +++ b/chapters/3-CPU-Microarchitecture/3-0 CPU microarchitecture.md @@ -2,4 +2,4 @@ # CPU Microarchitecture {#sec:uarch} -This chapter provides a brief summary of the critical CPU microarchitecture features that have a direct impact on software performance. The goal of this chapter is not to cover all the details and trade-offs of CPU architectures, which are already covered extensively in the literature [@Hennessy]. Instead, this chapter provides a quick recap of the CPU hardware features that are present in modern processors. +This chapter provides a brief summary of the critical CPU microarchitecture features that have a direct impact on software performance. The goal of this chapter is not to cover all the details and trade-offs of CPU architectures, which are already covered extensively in the literature [@Hennessy]. Instead, this chapter provides a recap of the CPU hardware features that are present in modern processors. diff --git a/chapters/3-CPU-Microarchitecture/3-1 ISA.md b/chapters/3-CPU-Microarchitecture/3-1 ISA.md index d873587924..152d242e3f 100644 --- a/chapters/3-CPU-Microarchitecture/3-1 ISA.md +++ b/chapters/3-CPU-Microarchitecture/3-1 ISA.md @@ -1,9 +1,9 @@ ## Instruction Set Architecture -The instruction set is the vocabulary used by software to communicate with the hardware. The instruction set architecture (ISA) defines the contract between the software and the hardware. Intel x86,[^1] ARM v8 and RISC-V are examples of current-day ISAs that are widely deployed. All of these are 64-bit architectures, i.e., all address computations use 64 bits. ISA developers and CPU architects typically ensure that software or firmware conforming to the specification will execute on any processor built using the specification. Widely deployed ISA franchises also typically ensure backward compatibility such that code written for the GenX version of a processor will continue to execute on GenX+i. +The instruction set architecture (ISA) is the contract between the software and the hardware, which defines the rules of communication. Intel x86-64,[^1] ARM v8 and RISC-V are examples of current-day ISAs that are widely deployed. All of these are 64-bit architectures, i.e., all address computations use 64 bits. ISA developers and CPU architects typically ensure that software or firmware conforming to the specification will execute on any processor built using the specification. Widely deployed ISAs also typically ensure backward compatibility such that code written for the GenX version of a processor will continue to execute on GenX+i. Most modern architectures can be classified as general-purpose register-based, load-store architectures, such as RISC-V and ARM where the operands are explicitly specified, and memory is accessed only using load and store instructions. The X86 ISA is a register-memory architecture, where operations can be performed on registers, as well as memory operands. In addition to providing the basic functions in an ISA such as load, store, control and scalar arithmetic operations using integers and floating-point, the widely deployed architectures continue to enhance their ISA to support new computing paradigms. These include enhanced vector processing instructions (e.g., Intel AVX2, AVX512, ARM SVE, RISC-V "V" vector extension) and matrix/tensor instructions (Intel AMX, ARM SME). Software mapped to use these advanced instructions typically provides orders of magnitude improvement in performance. -Modern CPUs support 32-bit and 64-bit precision for arithmetic operations. With the fast-evolving field of machine learning and AI, the industry has a renewed interest in alternative numeric formats for variables to drive significant performance improvements. Research has shown that machine learning models perform just as well, using fewer bits to represent the variables, saving on both compute and memory bandwidth. As a result, several CPU franchises have recently added support for lower precision data types such as 8-bit integers (int8, e.g., Intel VNNI), 16-bit floating-point (fp16, bf16) in the ISA, in addition to the traditional 32-bit and 64-bit formats for arithmetic operations. +Modern CPUs support 32-bit and 64-bit precision for floating-point and integer arithmetic operations. With the fast-evolving field of machine learning and AI, the industry has a renewed interest in alternative numeric formats for variables to drive significant performance improvements. Research has shown that machine learning models perform just as well, using fewer bits to represent variables, saving on both compute and memory bandwidth. As a result, several CPU franchises have recently added support for lower precision data types such as 8-bit integers (int8), 16-bit floating-point (fp16, bf16) in the ISA, in addition to the traditional 32-bit and 64-bit formats for arithmetic operations. -[^1]: In the book we write x86, but mean x86-64, which is a 64-bit version of the x86 instruction set, first announced in 1999. \ No newline at end of file +[^1]: In the book we sometimes write x86 for shortness, but we assume x86-64, which is a 64-bit version of the x86 instruction set, first announced in 1999. \ No newline at end of file diff --git a/chapters/3-CPU-Microarchitecture/3-2 Pipelining.md b/chapters/3-CPU-Microarchitecture/3-2 Pipelining.md index 324b3e8202..6780729b9b 100644 --- a/chapters/3-CPU-Microarchitecture/3-2 Pipelining.md +++ b/chapters/3-CPU-Microarchitecture/3-2 Pipelining.md @@ -12,11 +12,11 @@ Pipelining is the foundational technique used to make CPUs fast wherein multiple Figure @fig:Pipelining shows an ideal pipeline view of the 5-stage pipeline CPU. In cycle 1, instruction x enters the IF stage of the pipeline. In the next cycle, as instruction x moves to the ID stage, the next instruction in the program enters the IF stage, and so on. Once the pipeline is full, as in cycle 5 above, all pipeline stages of the CPU are busy working on different instructions. Without pipelining, instruction `x+1` couldn't start its execution until after instruction `x` had finished its work. -Modern high-performance CPUs have multiple pipeline stages, often ranging from 10 to 20 or more, depending on the architecture and design goals. This involves a much more complicated design than a simple 5-stage pipeline introduced earlier. For example, the decode stage may be split into several new stages, we may add new stages before the execute stage to buffer decoded instructions and so on. +Modern high-performance CPUs have multiple pipeline stages, often ranging from 10 to 20 or more, depending on the architecture and design goals. This involves a much more complicated design than a simple 5-stage pipeline introduced earlier. For example, the decode stage may be split into several new stages. We may also add new stages before the execute stage to buffer decoded instructions and so on. -The throughput of a pipelined CPU is defined as the number of instructions that complete and exit the pipeline per unit of time. The latency for any given instruction is the total time through all the stages of the pipeline. Since all the stages of the pipeline are linked together, each stage must be ready to move to the next instruction in lockstep. The time required to move an instruction from one stage to the next defines the basic machine cycle or clock for the CPU. The value chosen for the clock for a given pipeline is defined by the slowest stage of the pipeline. CPU hardware designers strive to balance the amount of work that can be done in a stage as this directly defines the frequency of operation of the CPU. Increasing the frequency improves performance and typically involves balancing and re-pipelining to eliminate bottlenecks caused by the slowest pipeline stages. +The *throughput* of a pipelined CPU is defined as the number of instructions that complete and exit the pipeline per unit of time. The *latency* for any given instruction is the total time through all the stages of the pipeline. Since all the stages of the pipeline are linked together, each stage must be ready to move to the next instruction in lockstep. The time required to move an instruction from one stage to the next defines the basic machine *cycle* or clock for the CPU. The value chosen for the clock for a given pipeline is defined by the slowest stage of the pipeline. CPU hardware designers strive to balance the amount of work that can be done in a stage as this directly affects the frequency of operation of the CPU. -In an ideal pipeline that is perfectly balanced and doesn’t incur any stalls, the time per instruction in the pipelined machine is given by +In an ideal pipeline that is perfectly balanced and doesn’t incur any stalls, the time per instruction in the pipelined machine is calculated as $$ \textrm{Time per instruction on pipelined machine} = \frac{\textrm{Time per instr. on nonpipelined machine}}{\textrm{Number of pipe stages}} $$ @@ -24,6 +24,8 @@ In real implementations, pipelining introduces several constraints that limit th \lstset{linewidth=10cm} +[TODO]: give example of a structural hazard + * **Structural hazards**: are caused by resource conflicts. To a large extent, they could be eliminated by replicating the hardware resources, such as using multi-ported registers or memories. However, eliminating all such hazards could potentially become quite expensive in terms of silicon area and power. * **Data hazards**: are caused by data dependencies in the program and are classified into three types: @@ -37,10 +39,10 @@ In real implementations, pipelining introduces several constraints that limit th There is a RAW dependency for register R1. If we take the value directly after addition `R0 ADD 1` is done (from the `EXE` pipeline stage), we don't need to wait until the `WB` stage finishes, and the value will be written to the register file. Bypassing helps to save a few cycles. The longer the pipeline, the more effective bypassing becomes. - A *write-after-read* (WAR) hazard requires a dependent write to execute after a read. It occurs when instruction `x+1` writes a source before instruction `x` reads the source, resulting in the wrong new value being read. A WAR hazard is not a true dependency and is eliminated by a technique called [register renaming](https://en.wikipedia.org/wiki/Register_renaming).[^1] It is a technique that abstracts logical registers from physical registers. CPUs support register renaming by keeping a large number of physical registers. Logical (architectural) registers, the ones that are defined by the ISA, are just aliases over a wider register file. With such decoupling of the [architectural state](https://en.wikipedia.org/wiki/Architectural_state),[^3] solving WAR hazards is simple: we just need to use a different physical register for the write operation. For example: + A *write-after-read* (WAR) hazard requires a dependent write to execute after a read. It occurs when instruction `x+1` writes a source before instruction `x` reads the source, resulting in the wrong new value being read. A WAR hazard is not a true dependency and is eliminated by a technique called *register renaming*. It is a technique that abstracts logical registers from physical registers. CPUs support register renaming by keeping a large number of physical registers. Logical (*architectural*) registers, the ones that are defined by the ISA, are just aliases over a wider register file. With such decoupling of the *architectural state, solving WAR hazards is simple: we just need to use a different physical register for the write operation. For example: ``` - ; machine code ; after register renaming + ; machine code, WAR hazard ; after register renaming ; (architectural registers) ; (physical registers) R1 = R0 ADD 1 => R101 = R100 ADD 1 R0 = R2 ADD 2 R103 = R102 ADD 2 @@ -51,7 +53,7 @@ In real implementations, pipelining introduces several constraints that limit th A *write-after-write* (WAW) hazard requires a dependent write to execute after a write. It occurs when instruction `x+1` writes a source before instruction `x` writes to the source, resulting in the wrong order of writes. WAW hazards are also eliminated by register renaming, allowing both writes to execute in any order while preserving the correct final result. Below is an example of eliminating WAW hazards. ``` - ; machine code ; after register renaming + ; machine code, WAW hazard ; after register renaming (architectural registers) (physical registers) R2 = R1 => R102 = R101 R2 = 0 R103 = 0 @@ -61,7 +63,4 @@ In real implementations, pipelining introduces several constraints that limit th * **Control hazards**: are caused due to changes in the program flow. They arise from pipelining branches and other instructions that change the program flow. The branch condition that determines the direction of the branch (taken vs. not taken) is resolved in the execute pipeline stage. As a result, the fetch of the next instruction cannot be pipelined unless the control hazard is eliminated. Techniques such as dynamic branch prediction and speculative execution described in the next section are used to overcome control hazards. -\lstset{linewidth=\textwidth} - -[^1]: Register renaming - [https://en.wikipedia.org/wiki/Register_renaming](https://en.wikipedia.org/wiki/Register_renaming). -[^3]: Architectural state - [https://en.wikipedia.org/wiki/Architectural_state](https://en.wikipedia.org/wiki/Architectural_state). \ No newline at end of file +\lstset{linewidth=\textwidth} \ No newline at end of file diff --git a/chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md b/chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md index a4e1c057da..7e2fa73630 100644 --- a/chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md +++ b/chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md @@ -6,13 +6,15 @@ Most instructions in a program lend themselves to be pipelined and executed in p The pipeline example in Figure @fig:Pipelining shows all instructions moving through the different stages of the pipeline in order, i.e., in the same order as they appear in the program, also known as *program order*. Most modern CPUs support *out-of-order* (OOO) execution, where sequential instructions can enter the execution stage in any arbitrary order only limited by their dependencies and resource availability. CPUs with OOO execution must still give the same result as if all instructions were executed in the program order. -An instruction is called *retired* after it is finally executed, and its results are correct and visible in the [architectural state](https://en.wikipedia.org/wiki/Architectural_state). To ensure correctness, CPUs must retire all instructions in the program order. OOO execution is primarily used to avoid underutilization of CPU resources due to stalls caused by dependencies, especially in superscalar engines described in the next section. +An instruction is called *retired* after it is finally executed, and its results are correct and visible in the architectural state. To ensure correctness, CPUs must retire all instructions in the program order. OOO execution is primarily used to avoid underutilization of CPU resources due to stalls caused by dependencies, especially in superscalar engines, which we will discuss shortly. ![The concept of out-of-order execution.](../../img/uarch/OOO.png){#fig:OOO width=80%} -Figure @fig:OOO details the concept underlying out-of-order execution with an example. Let's assume that instruction `x+1` cannot execute in cycles 4 and 5 due to some conflict. An in-order CPU would stall all subsequent instructions from entering the EXE pipeline stage. In a CPU with OOO execution, a subsequent instruction that does not have any conflicts (e.g., instruction `x+2`) can enter and complete its execution. All instructions still retire in order, i.e., the instructions complete the WB stage in the program order. +Figure @fig:OOO details the concept underlying out-of-order execution with an example. Let's assume that instruction `x+1` cannot execute in cycles 4 and 5 due to some conflict. An in-order CPU would stall all subsequent instructions from entering the EXE pipeline stage. In a CPU with OOO execution, a subsequent instruction that does not have any conflicts (e.g., instruction `x+2`) can enter and complete its execution. All instructions still retire in order, i.e., the instructions complete the WB stage in the program order. -Scheduling of these instructions can be done at compile time (static scheduling), or at runtime (dynamic scheduling). Let's unpack both. +[TODO]: explain why that would give more performance. + +Scheduling of these instructions can be done at compile time (static scheduling), or at runtime (dynamic scheduling). Let's unpack both options. #### Static scheduling @@ -24,13 +26,11 @@ The Intel Itanium never managed to become a success for a few reasons. One of th To overcome the problem with static scheduling, modern processors use dynamic scheduling. The two most important algorithms for dynamic scheduling are [Scoreboarding](https://en.wikipedia.org/wiki/Scoreboarding),[^4] and the [Tomasulo algorithm](https://en.wikipedia.org/wiki/Tomasulo_algorithm).[^5] -Scoreboarding was first implemented in the CDC6600 in the 1960s. Its main drawback is that it not only preserves true dependencies (RAW), but also false dependencies (WAW and WAR), and therefore it provides suboptimal ILP. False dependencies are caused by the small number of architectural registers, which is typically between 16 and 32 in modern ISAs. That is why all modern processors have adopted the Tomasulo algorithm for dynamic scheduling. The Tomasulo algorithm was invented in the 1960s by Robert Tomasulo and first implemented in the IBM360 model 91. - -To eliminate false dependencies, the Tomasulo algorithm makes use of register renaming, which we discussed in the previous section. Because of that, performance is greatly improved compared to scoreboarding. However, a sequence of instructions that carry a RAW dependency, also known as a *dependency chain*, is still problematic for OOO execution, because there is no increase in the ILP after register renaming since all the RAW dependencies are preserved. Dependency chains are often found in loops (loop carried dependency) where the current loop iteration depends on the results produced on the previous iteration. It gets even worse if many instructions depend on a load value that has not been found in caches. In that case, execution would stall waiting for that data to be fetched from memory. +Scoreboarding was first implemented in the CDC6600 processor in the 1960s. Its main drawback is that it not only preserves true dependencies (RAW), but also false dependencies (WAW and WAR), and therefore it provides suboptimal ILP. False dependencies are caused by the small number of architectural registers, which is typically between 16 and 32 in modern ISAs. That is why all modern processors have adopted the Tomasulo algorithm for dynamic scheduling. The Tomasulo algorithm was invented in the 1960s by Robert Tomasulo and first implemented in the IBM360 model 91. -Modern processors implement dynamic scheduling techniques that are derivatives of Tomasulo's original algorithm and include the Reorder Buffer (ROB) and the Reservation Station (RS). The ROB is a circular buffer that keeps track of the state of each instruction, and in modern processors, it has several hundred entries. Typically, the size of the ROB determines how far ahead the hardware can look for scheduling such independent instructions. Instructions are inserted in the ROB in program order, can execute out of order, and retire in program order. Register renaming is done when instructions are placed in the ROB. +To eliminate false dependencies, the Tomasulo algorithm makes use of register renaming, which we discussed in the previous section. Because of that, performance is greatly improved compared to scoreboarding. However, a sequence of instructions that carry a RAW dependency, also known as a *dependency chain*, is still problematic for OOO execution, because there is no increase in the ILP after register renaming since all the RAW dependencies are preserved. Dependency chains are often found in loops (loop carried dependency) where the current loop iteration depends on the results produced on the previous iteration. -Another feature provided by the ROB is *precise exceptions*. Instructions can run into exceptions e.g. division by zero. Because instructions can execute out of order, a later instruction might be executed before an earlier instruction which causes an exception. In such a scenario, the architectural state of the later instruction should not become visible. When an instruction runs into an exception, this event is recorded in the ROB entry for that instruction. When that instruction finally retires, the exception state is detected and this triggers flushing all later instructions from the ROB and then the exception handler is called. So it becomes clear where the exception occurred and no architectural state after the exception will become visible. +Modern processors implement dynamic scheduling techniques that are derivatives of Tomasulo's original algorithm and include the Reorder Buffer (ROB) and the Reservation Station (RS). The ROB is a circular buffer that keeps track of the state of each instruction, and in modern processors, it has several hundred entries. Typically, the size of the ROB determines how far ahead the hardware can look for scheduling instructions independently. Instructions are inserted in the ROB in program order, can execute out of order, and retire in program order. Register renaming is done when instructions are placed in the ROB. From the ROB, instructions are inserted in the RS, which has much fewer entries. Once instructions are in the RS, they wait for their input operands to become available. When inputs are available, instructions can be issued to the appropriate execution unit. So instructions can be executed in any order once their operands become available and are not tied to the program order any longer. Modern processors are becoming wider (can execute many instructions in one cycle) and deeper (larger ROB, RS, and other buffers), which demonstrates that there is a lot of potential to uncover more ILP in production applications. @@ -38,15 +38,17 @@ From the ROB, instructions are inserted in the RS, which has much fewer entries. Most modern CPUs are superscalar i.e., they can issue more than one instruction in a given cycle. Issue width is the maximum number of instructions that can be issued during the same cycle. The typical issue width of modern popular CPUs in 2024 ranges from 6 to 9. To ensure the right balance, such superscalar engines also have more than one execution unit and/or pipelined execution units. CPUs also combine superscalar capability with deep pipelines and out-of-order execution to extract the maximum ILP for a given piece of software. -![A pipeline diagram for a simple 2-way superscalar CPU.](../../img/uarch/SuperScalar.png){#fig:SuperScalar width=55%} +![A pipeline diagram of a code executing in a 2-way superscalar CPU.](../../img/uarch/SuperScalar.png){#fig:SuperScalar width=55%} + +[TODO]: give example of two instructions -Figure @fig:SuperScalar shows an example CPU that supports 2-wide issue width, i.e., in each cycle, two instructions are processed in each stage of the pipeline. Superscalar CPUs typically support multiple, independent execution units to keep the instructions in the pipeline flowing through without conflicts. In addition to pipelining, replicating execution units further increases the performance of a machine. +Figure @fig:SuperScalar shows a CPU that supports 2-wide issue, i.e., in each cycle, two instructions are processed in each stage of the pipeline. Superscalar CPUs typically support multiple, independent execution units to keep the instructions in the pipeline flowing through without conflicts. In addition to pipelining, replicating execution units further increases the performance of a machine. ### Speculative Execution {#sec:SpeculativeExec} -As noted in the previous section, control hazards can cause significant performance loss in a pipeline if instructions are stalled until the branch condition is resolved. One technique to avoid this performance loss is hardware branch prediction logic to predict the likely direction of branches and allow executing instructions from the predicted path (speculative execution). +As noted in the previous section, control hazards can cause significant performance loss in a pipeline if instructions are stalled until the branch condition is resolved. One technique to avoid this performance loss is hardware *branch prediction*. Using this technique, a CPU predicts the likely direction of branches and allow executing instructions from the predicted path (known as *speculative execution*). -Let's consider the example in @lst:Speculative. For a processor to understand which function it should execute next, it should know whether the condition `a < b` is false or true. Without knowing that, the CPU waits until the result of the branch instruction is determined, as shown in Figure @fig:NoSpeculation. +Let's consider an example in @lst:Speculative. For a processor to understand which function it should execute next, it should know whether the condition `a < b` is false or true. Without knowing that, the CPU waits until the result of the branch instruction is determined, as shown in Figure @fig:NoSpeculation. Listing: Speculative execution @@ -61,12 +63,16 @@ else ![No speculation](../../img/uarch/Speculative1.png){#fig:NoSpeculation width=60%} -![Speculative execution](../../img/uarch/Speculative2.png){#fig:SpeculativeExec width=60%} +![Speculative execution. Speculative work is marked with *.](../../img/uarch/Speculative2.png){#fig:SpeculativeExec width=60%} The concept of speculative execution. -With speculative execution, the CPU guesses an outcome of the branch and initiates processing instructions from the chosen path. Suppose a processor predicted that condition `a < b` will be evaluated as true. It proceeded without waiting for the branch outcome and speculatively called function `foo` (see Figure @fig:SpeculativeExec, speculative work is marked with `*`). State changes to the machine cannot be committed until the condition is resolved to ensure that the architectural state of the machine is never impacted by speculatively executing instructions. In the example above, the branch instruction compares two scalar values, which is fast. But in reality, a branch instruction can be dependent on a value loaded from memory, which can take hundreds of cycles. If the prediction turns out to be correct, it saves a lot of cycles. However, sometimes the prediction is incorrect, and the function `bar` should be called instead. In such a case, the results from the speculative execution must be squashed and thrown away. This is called the branch misprediction penalty, which we will discuss in [@sec:BbMisp]. +With speculative execution, the CPU guesses an outcome of the branch and initiates processing instructions from the chosen path. Suppose a processor predicted that condition `a < b` will be evaluated as true. It proceeded without waiting for the branch outcome and speculatively called function `foo` (see Figure @fig:SpeculativeExec). State changes to the machine cannot be committed until the condition is resolved to ensure that the architectural state of the machine is never impacted by speculatively executing instructions. + +[TODO]: maybe make example with `Load X; branch (x<0); Call foo; //instr from foo` + +In the example above, the branch instruction compares two scalar values, which is fast. But in reality, a branch instruction can be dependent on a value loaded from memory, which can take hundreds of cycles. If the prediction turns out to be correct, it saves a lot of cycles. However, sometimes the prediction is incorrect, and the function `bar` should be called instead. In such a case, the results from the speculative execution must be squashed and thrown away. This is called the branch misprediction penalty, which we will discuss in [@sec:BbMisp]. An instruction that is executed speculatively is marked as such in the ROB. Once it is not speculative any longer, it can retire in program order. Here is where the architectural state is committed, and architectural registers are updated. Because the results of the speculative instructions are not committed, it is easy to roll back when a misprediction happens. @@ -76,11 +82,11 @@ As we just have seen, correct predictions greatly improve execution as they allo * **Unconditional jumps and direct calls**: they are the easiest to predict as they are always taken and go in the same direction every time. * **Conditional branches**: they have two potential outcomes: taken or not taken. Taken branches can go forward or backward. Forward conditional branches are usually generated for `if-else` statements, which have a high chance of not being taken, as they frequently represent error-checking code. Backward conditional jumps are frequently seen in loops and are used to go to the next iteration of a loop; such branches are usually taken. -* **Indirect calls and jumps**: they have many targets. An indirect jump or indirect call can be generated for a `switch` statement, a function pointer, or a `virtual` function. A return from a function deserves attention because it has many potential targets as well. +* **Indirect calls and jumps**: they have many targets. An indirect jump or indirect call can be generated for a `switch` statement, a function pointer, or a `virtual` function call. A return from a function deserves attention because it has many potential targets as well. Most prediction algorithms are based on previous outcomes of the branch. The core of the branch prediction unit (BPU) is a branch target buffer (BTB), which caches the target addresses for every branch. Prediction algorithms consult the BTB every cycle to generate the next address from which to fetch instructions. The CPU uses that new address to fetch the next block of instructions. If no branches are identified in the current fetch block, the next address to fetch will be the next sequential aligned fetch block (fall through). -Unconditional branches do not require prediction; we just need to look up the target address in the BTB. Remember, every cycle the BPU needs to generate the next address from which to fetch instructions to avoid pipeline stalls. We could have extracted the address just from the instruction encoding itself, but then we have to wait until the decode stage is over, which will introduce a bubble in the pipeline and make things slower. So, the next fetch address has to be determined at the time when the branch is fetched. +Unconditional branches do not require prediction; we just need to look up the target address in the BTB. Every cycle the BPU needs to generate the next address from which to fetch instructions to avoid pipeline stalls. We could have extracted the address just from the instruction encoding itself, but then we have to wait until the decode stage is over, which will introduce a bubble in the pipeline and make things slower. So, the next fetch address has to be determined at the time when the branch is fetched. For conditional branches, we first need to predict whether the branch will be taken or not. If it is not taken, then we fall through and there is no need to look up the target. Otherwise, we look up the target address in the BTB. Conditional branches usually account for the biggest portion of total branches and are the main source of misprediction penalties in production software. For indirect branches, we need to select one of the possible targets, but the prediction algorithm can be very similar to conditional branches. diff --git a/chapters/3-CPU-Microarchitecture/3-4 SIMD.md b/chapters/3-CPU-Microarchitecture/3-4 SIMD.md index f3f3d81ecf..e8407a6edf 100644 --- a/chapters/3-CPU-Microarchitecture/3-4 SIMD.md +++ b/chapters/3-CPU-Microarchitecture/3-4 SIMD.md @@ -1,8 +1,8 @@ ## SIMD Multiprocessors {#sec:SIMD} -Another variant of multiprocessing that is widely used for many workloads is referred to as Single Instruction Multiple Data (SIMD). As the name indicates, in SIMD processors, a single instruction operates on many data elements in a single cycle using many independent functional units. Operations on vectors and matrices lend themselves well to SIMD architectures as every element of a vector or matrix can be processed using the same instruction. A SIMD architecture enables more efficient processing of a large amount of data and works best for data-parallel applications that involve vector operations. +Another technique to facilitate parallel processing is called Single Instruction Multiple Data (SIMD), which is used in nearly all high-performance processors. As the name indicates, in a SIMD processor, a single instruction operates on many data elements in a single cycle using many independent functional units. Operations on vectors and matrices lend themselves well to SIMD architectures as every element of a vector or matrix can be processed using the same instruction. A SIMD architecture enables more efficient processing of a large amount of data and works best for data-parallel applications that involve vector operations. -Figure @fig:SIMD shows scalar and SIMD execution modes for the code in @lst:SIMD. In a traditional SISD (Single Instruction, Single Data) mode, the addition operation is separately applied to each element of arrays `a` and `b`. However, in SIMD mode, addition is applied to multiple elements at the same time. If we target a CPU architecture that has execution units capable of performing operations on 256-bit vectors, we can process four double-precision elements with a single instruction. This leads to issuing 4x fewer instructions and can potentially gain a 4x speedup over four scalar computations. But in practice, performance benefits are not so straightforward for various reasons. +Figure @fig:SIMD shows scalar and SIMD execution modes for the code in @lst:SIMD. In a traditional Single Instruction Single Data (SISD) mode, also known as *scalar* mode, the addition operation is separately applied to each element of arrays `a` and `b`. However, in SIMD mode, addition is applied to multiple elements at the same time. If we target a CPU architecture that has execution units capable of performing operations on 256-bit vectors, we can process four double-precision elements with a single instruction. This leads to issuing 4x fewer instructions and can potentially gain a 4x speedup over four scalar computations. Listing: SIMD execution @@ -19,15 +19,15 @@ For regular SISD instructions, processors utilize general-purpose registers. Sim A vector execution unit is logically divided into *lanes*. In the context of SIMD, a lane refers to a distinct data pathway within the SIMD execution unit and processes one element of the vector. In our example, each lane processes 64-bit elements (double-precision), so there will be 4 lanes in a 256-bit register. -Most of the popular CPU architectures feature vector instructions, including x86, PowerPC, Arm, and RISC-V. In 1996 Intel released MMX, a SIMD instruction set, that was designed for multimedia applications. Following MMX, Intel introduced new instruction sets with added capabilities and increased vector size: SSE, AVX, AVX2, AVX-512. Arm has optionally supported the 128-bit NEON instruction set in various versions of its architecture. In version 8 (aarch64), this support was made mandatory, and new instructions were added. +Most of the popular CPU architectures feature vector instructions, including x86, PowerPC, ARM, and RISC-V. In 1996 Intel released MMX, a SIMD instruction set, that was designed for multimedia applications. Following MMX, Intel introduced new instruction sets with added capabilities and increased vector size: SSE, AVX, AVX2, AVX-512. Arm has optionally supported the 128-bit NEON instruction set in various versions of its architecture. In version 8 (aarch64), this support was made mandatory, and new instructions were added. As the new instruction sets became available, work began to make them usable to software engineers. The software changes required to exploit SIMD instructions are known as *code vectorization*. Initially, SIMD instructions were programmed in assembly. Later, special compiler intrinsics, which are small functions providing a one-to-one mapping to SIMD instructions, were introduced. Today all the major compilers support autovectorization for the popular processors, i.e., they can generate SIMD instructions straight from high-level code written in C/C++, Java, Rust and other languages. -To enable code to run on systems that support different vector lengths, Arm introduced the SVE instruction set. Its defining characteristic is the concept of *scalable vectors*: their length is unknown at compile time. With SVE, there is no need to port software to every possible vector length. Users don't have to recompile the source code of their applications to leverage wider vectors when they become available in newer CPU generations. Another example of scalable vectors is the RISC-V V extension (RVV), which was ratified in late 2021. Some implementations support quite wide (2048-bit) vectors, and up to eight can be grouped together to yield `16,384`-bit vectors, which greatly reduces the number of instructions executed. At each loop iteration, user code typically does `ptr += number_of_lanes`, where `number_of_lanes` is not known at compile time. ARM SVE provides special instructions for such length-dependent operations, while RVV enables a programmer to query/set the `number_of_lanes`. +To enable code to run on systems that support different vector lengths, Arm introduced the SVE instruction set. Its defining characteristic is the concept of *scalable vectors*: their length is unknown at compile time. With SVE, there is no need to port software to every possible vector length. Users don't have to recompile the source code of their applications to leverage wider vectors when they become available in newer CPU generations. Another example of scalable vectors is the RISC-V V extension (RVV), which was ratified in late 2021. Some implementations support quite wide (2048-bit) vectors, and up to eight can be grouped together to yield 16384-bit vectors, which greatly reduces the number of instructions executed. At each loop iteration, user code typically does `ptr += number_of_lanes`, where `number_of_lanes` is not known at compile time. ARM SVE provides special instructions for such length-dependent operations, while RVV enables a programmer to query/set the `number_of_lanes`. Going back to the example in @lst:SIMD, if `N` equals 5, and we have a 256-bit vector, we cannot process all the elements in a single iteration. We can process the first four elements using a single SIMD instruction, but the 5th element needs to be processed individually. This is known as the *loop remainder*. Loop remainder is a portion of a loop that must process fewer elements than the vector width, requiring additional scalar code to handle the leftover elements. Scalable vector ISA extensions do not have this problem, as they can process any number of elements in a single instruction. Another solution to the loop remainder problem is to use *masking*, which allows selectively enable or disable SIMD lanes based on a condition. -Also, CPUs increasingly accelerate the matrix multiplications often used in machine learning. Intel's AMX extension, supported in Sapphire Rapids, multiplies 8-bit matrices of shape 16x64 and 64x16, accumulating into a 32-bit 16x16 matrix. By contrast, the unrelated but identically named AMX extension in Apple CPUs, as well as ARM's SME extension, computes outer products of a row and column, respectively stored in special 512-bit registers or scalable vectors. +Also, CPUs increasingly accelerate the matrix multiplications often used in machine learning. Intel's AMX extension, supported in server processors since 2023, multiplies 8-bit matrices of shape 16x64 and 64x16, accumulating into a 32-bit 16x16 matrix. By contrast, the unrelated but identically named AMX extension in Apple CPUs, as well as ARM's SME extension, computes outer products of a row and column, respectively stored in special 512-bit registers or scalable vectors. Initially, SIMD was driven by multimedia applications and scientific computations, but later found uses in many other domains. Over time, the set of operations supported in SIMD instruction sets has steadily increased. In addition to straightforward arithmetic as shown in Figure @fig:SIMD, newer use cases of SIMD include: diff --git a/chapters/3-CPU-Microarchitecture/3-5 Exploiting TLP.md b/chapters/3-CPU-Microarchitecture/3-5 Exploiting TLP.md index 5ef036d57b..abddc98b6d 100644 --- a/chapters/3-CPU-Microarchitecture/3-5 Exploiting TLP.md +++ b/chapters/3-CPU-Microarchitecture/3-5 Exploiting TLP.md @@ -1,22 +1,22 @@ ## Exploiting Thread-Level Parallelism -Techniques described previously rely on the available parallelism in a program to speed up execution. In addition to that, CPUs support techniques to exploit parallelism across processes and/or threads executing on the CPU. Next, we will discuss three techniques to exploit Thread-Level Parallelism (TLP): multicore systems, simultaneous multithreading and hybrid architectures. Such techniques make it possible to eke out the most efficiency from the available hardware resources and to improve the throughput of the system. +Techniques described previously rely on the available parallelism in a program to speed up execution. In addition to that, CPUs support techniques to exploit parallelism across processes and/or threads executing on a CPU. Next, we will discuss three techniques to exploit Thread-Level Parallelism (TLP): multicore systems, simultaneous multithreading and hybrid architectures. Such techniques make it possible to eke out the most efficiency from the available hardware resources and to improve the throughput of the system. ### Multicore Systems As processor architects began to reach the practical limitations of semiconductor design and fabrication, the GHz race slowed down and designers had to focus on other innovations to improve CPU performance. One of the key directions was the multicore design that attempted to increase core counts for each generation. The idea was to replicate multiple processor cores on a single chip and let them serve different programs at the same time. For example, one of the cores could run a web browser, another core could render a video, and yet another could play music, all at the same time. For a server machine, requests from different clients could be processed on separate cores, which could greatly increase the throughput of such a system. -The first consumer-focused dual-core processor was the Intel Core 2 Duo, released in 2005, which was followed by the AMD Athlon X2 architecture released later that same year. Multicore systems caused many software components to be redesigned and affected the way we write code. These days nearly all processors in consumer-facing devices are multicore CPUs. At the time of writing this book, high-end laptops contain more than ten physical cores and server processors contain almost 100 cores. +The first consumer-focused dual-core processor was the Intel Core 2 Duo, released in 2005, which was followed by the AMD Athlon X2 architecture released later that same year. Multicore systems caused many software components to be redesigned and affected the way we write code. These days nearly all processors in consumer-facing devices are multicore CPUs. At the time of writing this book, high-end laptops contain more than ten physical cores and server processors contain more than 100 cores on a single socket. It may sound very impressive, but we cannot add cores infinitely. First of all, each core generates heat when it's working and safely dissipating that heat from the cores through the processor package remains a challenge. This means that when more cores are running, heat can quickly exceed cooling capability. In such a situation, multicore processors will reduce clock speeds. This is one of the reasons you can see server chips with a large number of cores having much lower frequencies than processors that go into laptops and desktops. -Cores in a multicore system are connected to each other and to shared resources, such as last-level cache and memory controllers. Such a communication channel is called an *interconnect*, which frequently has either a ring or a mesh topology. Another challenge for CPU designers is to keep the machine balanced as the core counts get higher. When you replicate cores, some resources remain shared, for example, memory buses and last-level cache. This results in diminishing returns to performance as cores are added, unless you also address the throughput of other shared resources, e.g., interconnect bandwidth, last-level cache size and bandwidth, and memory bandwidth. Shared resources frequently become the source of performance issues in a multicore system. +Cores in a multicore system are connected to each other and to shared resources, such as last-level cache and memory controllers. Such a communication channel is called an *interconnect*, which frequently has either a ring or a mesh topology. Another challenge for CPU designers is to keep the machine balanced as the core count grows. When you replicate cores, some resources remain shared, for example, memory buses and last-level cache. This results in diminishing returns to performance as cores are added, unless you also address the throughput of other shared resources, e.g., interconnect bandwidth, last-level cache size and bandwidth, and memory bandwidth. Shared resources frequently become the source of performance issues in a multicore system. ### Simultaneous Multithreading A more sophisticated approach to improve multithreaded performance is Simultaneous Multithreading (SMT). Very frequently people use the term *Hyperthreading* to describe the same thing. The goal of the technique is to fully utilize the available width of the CPU pipeline. SMT allows multiple software threads to run simultaneously on the same physical core using shared resources. More precisely, instructions from multiple software threads execute concurrently in the same cycle. Those don't have to be threads from the same process; they can be completely different programs that happened to be scheduled on the same physical core. -An example of execution on a non-SMT and an SMT2 processor is shown in Figure @fig:SMT. In both cases, the width of the processor pipeline is four, and each slot represents an opportunity to issue a new instruction. 100% machine utilization is when there are no unused slots, which never happens in real workloads. It's easy to see that for the non-SMT case, there are many unused slots, so the available resources are not utilized well. This may happen for a variety of reasons; one common reason is a cache miss. At cycle 3, thread 1 cannot make forward progress because it is waiting for data to arrive. SMT processors take this opportunity to schedule useful work from another thread. The goal here is to occupy unused slots by another thread to hide memory latency and improve hardware utilization and multithreaded performance. +An example of execution on a non-SMT and a 2-way SMT (SMT2) processor is shown in Figure @fig:SMT. In both cases, the width of the processor pipeline is four, and each slot represents an opportunity to issue a new instruction. 100% machine utilization is when there are no unused slots, which never happens in real workloads. It's easy to see that for the non-SMT case, there are many unused slots, so the available resources are not utilized well. This may happen for a variety of reasons. For instance, at cycle 3, thread 1 cannot make forward progress because all instructions are waiting for their inputs to become available. Non-SMT processors would simply stall, while SMT-enabled processors take this opportunity to schedule useful work from another thread. The goal here is to occupy unused slots by another thread to improve hardware utilization and multithreaded performance. ![Execution on a 4-wide non-SMT and a 4-wide SMT2 processor.](../../img/uarch/SMT.png){#fig:SMT width=90%} @@ -36,7 +36,7 @@ There is also a security concern with certain simultaneous multithreading implem ### Hybrid Architectures -Computer architects also developed a hybrid CPU design in which two (or more) types of cores are put in the same processor. Typically, more powerful cores are coupled with relatively slower cores to address different goals. In such a system, big cores are used for latency-sensitive tasks and small cores provide reduced power consumption. But also, both types of cores can be utilized at the same time to improve multithreaded performance. All cores have access to the same memory, so workloads can migrate from big to small cores and back on the fly. The intention is to create a multicore processor that can adapt better to dynamic computing needs and use less power. For example, video games have parts of single-core burst performance as well as parts where they can scale to many cores. +Computer architects also developed a hybrid CPU design in which two (or more) types of cores are put in the same processor. Typically, more powerful cores are coupled with relatively slower cores to address different goals. In such a system, big cores are used for latency-sensitive tasks and small cores provide reduced power consumption. But also, both types of cores can be utilized at the same time to improve multithreaded performance. All cores have access to the same memory, so workloads can migrate from big to small cores and back on the fly. The intention is to create a multicore processor that can adapt better to dynamic computing needs and use less power. For example, video games have portions of single-threaded bursty execution as well as portions of work that can be scaled to many cores. The first mainstream hybrid architecture was ARM's big.LITTLE, which was introduced in October 2011. Other vendors followed this approach. Apple introduced its M1 chip in 2020 which has four high-performance "Firestorm" and four energy-efficient "Icestorm" cores. Intel introduced its Alderlake hybrid architecture in 2021 with eight P- and eight E-cores in the top configuration. diff --git a/chapters/3-CPU-Microarchitecture/3-6 Memory Hierarchy.md b/chapters/3-CPU-Microarchitecture/3-6 Memory Hierarchy.md index 2a9e099cd0..11d2fdeed1 100644 --- a/chapters/3-CPU-Microarchitecture/3-6 Memory Hierarchy.md +++ b/chapters/3-CPU-Microarchitecture/3-6 Memory Hierarchy.md @@ -1,6 +1,6 @@ ## Memory Hierarchy {#sec:MemHierar} -To effectively utilize all the hardware resources provisioned in the CPU, the machine needs to be fed with the right data at the right time. Understanding the memory hierarchy is critically important to delivering the performance capabilities of a CPU. Most programs exhibit the property of locality: they don’t access all code or data uniformly. A CPU memory hierarchy is built on two fundamental properties: +To effectively utilize all the hardware resources provisioned in a CPU, the machine needs to be fed with the right data at the right time. Understanding the memory hierarchy is critically important to delivering the performance capabilities of a CPU. Most programs exhibit the property of locality: they don’t access all code or data uniformly. A CPU memory hierarchy is built on two fundamental properties: * **Temporal locality**: when a given memory location is accessed, the same location will likely be accessed again soon. Ideally, we want this information to be in the cache next time we need it. * **Spatial locality**: when a given memory location is accessed, nearby locations will likely be accessed soon. This refers to placing related data close to each other. When a program reads a single byte from memory, typically, a larger chunk of memory (a cache line) is fetched because very often, the program will require that data soon. @@ -9,13 +9,13 @@ This section provides a summary of the key attributes of memory hierarchy system ### Cache Hierarchy {#sec:CacheHierarchy} -A cache is the first level of the memory hierarchy for any request (for code or data) issued from the CPU pipeline. Ideally, the pipeline performs best with an infinite cache with the smallest access latency. In reality, the access time for any cache increases as a function of the size. Therefore, the cache is organized as a hierarchy of small, fast storage blocks closest to the execution units, backed up by larger, slower blocks. A particular level of the cache hierarchy can be used exclusively for code (instruction cache, i-cache) or for data (data cache, d-cache), or shared between code and data (unified cache). Furthermore, some levels of the hierarchy can be private to a particular core, while other levels can be shared among cores. +A cache is the first level of the memory hierarchy for any request (for code or data) issued from the CPU pipeline. Ideally, the pipeline performs best with an infinite cache with the smallest access latency. In reality, the access time for any cache increases as a function of the size. Therefore, the cache is organized as a hierarchy of small, fast storage blocks closest to the execution units, backed up by larger, slower blocks. A particular level of the cache hierarchy can be used exclusively for code (instruction cache, I-cache) or for data (data cache, D-cache), or shared between code and data (unified cache). Furthermore, some levels of the hierarchy can be private to a particular core, while other levels can be shared among cores. -Caches are organized as blocks with a defined block size (**cache line**). The typical cache line size in modern CPUs is 64 bytes. However, the notable exception here is the L2 cache in Apple processors (such as M1, M2 and later), which operates on 128B cache lines. Caches closest to the execution pipeline typically range in size from 32 KB to 128 KB. Mid-level caches tend to have 1MB and above. Last-level caches in modern CPUs can be tens or even hundreds of megabytes. +Caches are organized as blocks with a defined size, also known as *cache lines*. The typical cache line size in modern CPUs is 64 bytes. However, the notable exception here is the L2 cache in Apple processors (such as M1, M2 and later), which operates on 128B cache lines. Caches closest to the execution pipeline typically range in size from 32 KB to 128 KB. Mid-level caches tend to have 1MB and above. Last-level caches in modern CPUs can be tens or even hundreds of megabytes. #### Placement of Data within the Cache. -The address for a request is used to access the cache. In direct-mapped caches, a given block address can appear only in one location in the cache and is defined by a mapping function shown below. +The address for a request is used to access the cache. In *direct-mapped* caches, a given block address can appear only in one location in the cache and is defined by a mapping function shown below. $$ \textrm{Number of Blocks in the Cache} = \frac{\textrm{Cache Size}}{\textrm{Cache Block Size}} $$ @@ -23,9 +23,9 @@ $$ \textrm{Direct mapped location} = \textrm{(block address) mod (Number of Blocks in the Cache )} $$ -In a fully associative cache, a given block can be placed in any location in the cache. +In a *fully associative* cache, a given block can be placed in any location in the cache. -An intermediate option between direct mapping and fully associative mapping is a set-associative mapping. In such a cache, the blocks are organized as sets, typically each set containing 2, 4, 8 or 16 blocks. A given address is first mapped to a set. Within a set, the address can be placed anywhere, among the blocks in that set. A cache with m blocks per set is described as an m-way set-associative cache. The formulas for a set-associative cache are: +An intermediate option between direct mapping and fully associative mapping is a *set-associative* mapping. In such a cache, the blocks are organized as sets, typically each set containing 2, 4, 8 or 16 blocks. A given address is first mapped to a set. Within a set, the address can be placed anywhere, among the blocks in that set. A cache with m blocks per set is described as an m-way set-associative cache. The formulas for a set-associative cache are: $$ \textrm{Number of Sets in the Cache} = \frac{\textrm{Number of Blocks in the Cache}}{\textrm{Number of Blocks per Set (associativity)}} $$ @@ -67,17 +67,17 @@ Out of these options, most designs typically choose to implement a write-back ca #### Other Cache Optimization Techniques. -For a programmer, understanding the behavior of the cache hierarchy is critical to extracting performance from any application. From the perspective of the pipeline, the latency to access any request is given by the following formula that can be applied recursively to all the levels of the cache hierarchy up to the main memory: +For a programmer, understanding the behavior of the cache hierarchy is critical to extracting performance from any application. From the perspective of the CPU pipeline, the latency to access any request is given by the following formula that can be applied recursively to all the levels of the cache hierarchy up to the main memory: $$ \textrm{Average Access Latency} = \textrm{Hit Time } + \textrm{ Miss Rate } \times \textrm{ Miss Penalty} $$ -Hardware designers take on the challenge of reducing the hit time and miss penalty through many novel micro-architecture techniques. Fundamentally, cache misses stall the pipeline and hurt performance. The miss rate for any cache is highly dependent on the cache architecture (block size, associativity) and the software running on the machine. As a result, optimizing the miss rate becomes a hardware-software co-design effort. As discussed earlier, CPUs provide optimal hardware organization for caches. Additional techniques that can be implemented both in hardware and software to minimize cache miss rates are described below. +Hardware designers take on the challenge of reducing the hit time and miss penalty through many novel micro-architecture techniques. Fundamentally, cache misses stall the pipeline and hurt performance. The miss rate for any cache is highly dependent on the cache architecture (block size, associativity) and the software running on the machine. #### Hardware and Software Prefetching. {#sec:HwPrefetch} -One method to avoid a cache miss and the subsequent stall is to prefetch instructions as well as data into different levels of the cache hierarchy prior to when the pipeline demands. The assumption is the time to handle the miss penalty can be mostly hidden if the prefetch request is issued sufficiently ahead in the pipeline. Most CPUs provide implicit hardware-based prefetching that is complemented by explicit software prefetching that programmers can control. +One method to avoid cache misses and subsequent stalls is to prefetch data into caches prior to when the pipeline demands it. The assumption is the time to handle the miss penalty can be mostly hidden if the prefetch request is issued sufficiently ahead in the pipeline. Most CPUs provide implicit hardware-based prefetching that is complemented by explicit software prefetching that programmers can control. -Hardware prefetchers observe the behavior of a running application and initiate prefetching on repetitive patterns of cache misses. Hardware prefetching can automatically adapt to the dynamic behavior of the application, such as varying data sets, and does not require support from an optimizing compiler or profiling support. Also, the hardware prefetching works without the overhead of additional address generation and prefetch instructions. However, hardware prefetching is limited to learning and prefetching for a limited set of cache-miss patterns. +Hardware prefetchers observe the behavior of a running application and initiate prefetching on repetitive patterns of cache misses. Hardware prefetching can automatically adapt to the dynamic behavior of an application, such as varying data sets, and does not require support from an optimizing compiler. Also, the hardware prefetching works without the overhead of additional address generation and prefetch instructions. However, hardware prefetching works for a limited set of commonly used data access patterns. Software memory prefetching complements prefetching done by hardware. Developers can specify which memory locations are needed ahead of time via dedicated hardware instruction (see [@sec:memPrefetch]). Compilers can also automatically add prefetch instructions into the code to request data before it is required. Prefetch techniques need to balance between demand and prefetch requests to guard against prefetch traffic slowing down demand traffic. @@ -85,13 +85,13 @@ Software memory prefetching complements prefetching done by hardware. Developers Main memory is the next level of the hierarchy, downstream from the caches. Requests to load and store data are initiated by the Memory Controller Unit (MCU). In the past, this circuit was located in the north bridge chip on the motherboard. But nowadays, most processors have this component embedded, so the CPU has a dedicated memory bus connecting it to the main memory. -Main memory uses DRAM (Dynamic Random Access Memory), technology that supports large capacities at reasonable cost points. When comparing DRAM modules, people usually look at memory density and memory speed, along with its price of course. Memory density defines how much memory the module has measured in GB. Obviously, the more available memory the better as it is a precious resource used by the OS and applications. +Main memory uses DRAM (Dynamic Random Access Memory) technology that supports large capacities at reasonable cost points. When comparing DRAM modules, people usually look at memory density and memory speed, along with its price of course. Memory density defines how much memory the module has measured in GB. Obviously, the more available memory the better as it is a precious resource used by the OS and applications. The performance of the main memory is described by latency and bandwidth. Memory latency is the time elapsed between the memory access request being issued and when the data is available to use by the CPU. Memory bandwidth defines how many bytes can be fetched per some period of time, and is usually measured in gigabytes per second. #### DDR -DDR (Double Data Rate) DRAM technology is the predominant DRAM technology supported by most CPUs. Historically, DRAM bandwidths have improved every generation while the DRAM latencies have stayed the same or even increased. Table @tbl:mem_rate shows the top data rate, peak bandwidth, and the corresponding reading latency for the last three generations of DDR technologies. The data rate is measured in millions of transfers per second (MT/s). The latencies shown in this table correspond to the latency in the DRAM device itself. Typically, the latencies as seen from the CPU pipeline (cache miss on a load to use) are higher (in the 50ns-150ns range) due to additional latencies and queuing delays incurred in the cache controllers, memory controllers, and on-die interconnects. You can see an example of measuring observed memory latency and bandwidth in [@sec:MemLatBw]. +(Double Data Rate) is the predominant DRAM technology supported by most CPUs. Historically, DRAM bandwidths have improved every generation while the DRAM latencies have stayed the same or increased. Table @tbl:mem_rate shows the top data rate, peak bandwidth, and the corresponding reading latency for the last three generations of DDR technologies. The data rate is measured in millions of transfers per second (MT/s). The latencies shown in this table correspond to the latency in the DRAM device itself. Typically, the latencies as seen from the CPU pipeline (cache miss on a load to use) are higher (in the 50ns-150ns range) due to additional latencies and queuing delays incurred in the cache controllers, memory controllers, and on-die interconnects. You can see an example of measuring observed memory latency and bandwidth in [@sec:MemLatBw]. ----------------------------------------------------------------- DDR Year Highest Data Peak Bandwidth In-device Read @@ -113,9 +113,9 @@ A DRAM module is organized as a set of DRAM chips. Memory *rank* is a term that Each rank consists of multiple DRAM chips. Memory *width* defines how wide the bus of each DRAM chip is. And since each rank is 64-bits wide (or 72-bits wide for ECC RAM), it also defines the number of DRAM chips present within the rank. Memory width can be one of three values: `x4`, `x8` or `x16`, which define how wide the bus that goes to each chip. As an example, Figure @fig:Dram_ranks shows the organization of a 2Rx16 dual-rank DRAM DDR4 module, with a total of 2GB capacity. There are four chips in each rank, with a 16-bit wide bus. Combined, the four chips provide 64-bit output. The two ranks are selected one at a time through a rank-select signal. -![Organization of 2Rx16 dual-rank DRAM DDR4 module, total 2GB capacity.](../../img/uarch/DRAM_ranks.png){#fig:Dram_ranks width=90%} +![Organization of a 2Rx16 dual-rank DRAM DDR4 module with a total capacity of 2GB.](../../img/uarch/DRAM_ranks.png){#fig:Dram_ranks width=90%} -There is no direct answer as to whether the performance of single-rank or dual-rank is better as it depends on the type of application. Switching from one rank to another through rank select signal needs additional clock cycles, which may increase the access latency. On the other hand, if a rank is not accessed, it can go through its refresh cycles in parallel while other ranks are busy. As soon as the previous rank completes data transmission, the next rank can immediately start its transmission. Also, single-rank modules produce less heat and are less likely to fail. +There is no direct answer as to whether the performance of single-rank or dual-rank is better as it depends on the type of application. Single-rank modules generally produce less heat and are less likely to fail. Also, multi-rank modules require a rank select signal to switch from one rank to another, which needs additional clock cycles and may increase the access latency. On the other hand, if a rank is not accessed, it can go through its refresh cycles in parallel while other ranks are busy. As soon as the previous rank completes data transmission, the next rank can immediately start its transmission. Going further, we can install multiple DRAM modules in a system to not only increase memory capacity but also memory bandwidth. Setups with multiple memory channels are used to scale up the communication speed between the memory controller and the DRAM. @@ -130,9 +130,9 @@ $$ \textrm{Max. Memory Bandwidth} = \textrm{Data Rate } \times \textrm{ Bytes per cycle } $$ -For example, for a single-channel DDR4 configuration, the data rate is `2400 MT/s` and 64 bits (8 bytes) can be transferred each memory cycle, thus the maximum bandwidth equals `2400 * 8 = 19.2 GB/s`. Dual-channel or dual memory controller setups double the bandwidth to `38.4 GB/s`. Remember though, those numbers are theoretical maximums that assume that a data transfer will occur at each memory clock cycle, which in fact never happens in practice. So, when measuring actual memory speed, you will always see a value lower than the maximum theoretical transfer bandwidth. +For example, for a single-channel DDR4 configuration, with the data rate of 2400 MT/s and 64 bits (8 bytes) can be transferred each memory cycle, the maximum bandwidth equals `2400 * 8 = 19.2 GB/s`. Dual-channel or dual memory controller setups double the bandwidth to 38.4 GB/s. Remember though, those numbers are theoretical maximums that assume that a data transfer will occur at each memory clock cycle, which in fact never happens in practice. So, when measuring actual memory speed, you will always see a value lower than the maximum theoretical transfer bandwidth. -To enable multi-channel configuration, you need to have a CPU and motherboard that support such an architecture and install an even number of identical memory modules in the correct memory slots on the motherboard. The quickest way to check the setup on Windows is by running a hardware identification utility like `CPU-Z` or `HwInfo`; on Linux, you can use the `dmidecode` command. Alternatively, you can run memory bandwidth benchmarks like Intel MLC or `Stream`. +To enable multi-channel configuration, you need to have a CPU and motherboard that support such an architecture and install an even number of identical memory modules in the correct memory slots on the motherboard. The quickest way to check the setup on Windows is by running a hardware identification utility like `CPU-Z` or `HwInfo`; on Linux, you can use the `dmidecode` command. Alternatively, you can run memory bandwidth benchmarks like Intel MLC or Stream. To make use of multiple memory channels in a system, there is a technique called interleaving. It spreads adjacent addresses within a page across multiple memory devices. An example of a 2-way interleaving for sequential memory accesses is shown in Figure @fig:Dram_channel_interleaving. As before, we have a dual-channel memory configuration (channels A and B) with two independent memory controllers. Modern processors interleave per four cache lines (256 bytes), i.e., the first four adjacent cache lines go to channel A, and then the next set of four cache lines go to channel B. @@ -150,4 +150,4 @@ GDDR was primarily designed for graphics and nowadays it is used on virtually ev HBM is a new type of CPU/GPU memory that vertically stacks memory chips, also called 3D stacking. Similar to GDDR, HBM drastically shortens the distance data needs to travel to reach a processor. The main difference from DDR and GDDR is that the HBM memory bus is very wide: 1024 bits for each HBM stack. This enables HBM to achieve ultra-high bandwidth. The latest HBM3 standard supports up to 665 GB/s bandwidth per package. It also operates at a low frequency of 500 MHz and has a memory density of up to 48 GB per package. -A system with HBM onboard will be a good choice if you're looking to get as much memory bandwidth as you can get. However, at the time of writing, this technology is quite expensive. As GDDR is predominantly used in graphics cards, HBM may be a good option to accelerate certain workloads that run on a CPU. In fact, the first x86 general-purpose server chips with integrated HBM are now available. \ No newline at end of file +A system with HBM onboard will be a good choice if you want to maximize data transfer throughput. However, at the time of writing, this technology is quite expensive. As GDDR is predominantly used in graphics cards, HBM may be a good option to accelerate certain workloads that run on a CPU. In fact, the first x86 general-purpose server chips with integrated HBM are now available. \ No newline at end of file diff --git a/chapters/3-CPU-Microarchitecture/3-7 Virtual memory.md b/chapters/3-CPU-Microarchitecture/3-7 Virtual memory.md index b0eab1841b..bc6fbb65d6 100644 --- a/chapters/3-CPU-Microarchitecture/3-7 Virtual memory.md +++ b/chapters/3-CPU-Microarchitecture/3-7 Virtual memory.md @@ -2,9 +2,9 @@ Virtual memory is the mechanism to share the physical memory attached to a CPU with all the processes executing on the CPU. Virtual memory provides a protection mechanism that prevents access to the memory allocated to a given process from other processes. Virtual memory also provides relocation, which is the ability to load a program anywhere in physical memory without changing the addresses in the program. -In a CPU that supports virtual memory, programs use virtual addresses for their accesses. But while user code operates on virtual addresses, retrieving data from memory requires physical addresses. Also, to effectively manage the scarce physical memory, it is divided into pages. Thus, applications operate on a set of pages that an operating system has provided. +In a CPU that supports virtual memory, programs use virtual addresses for their accesses. But while user code operates on virtual addresses, retrieving data from memory requires physical addresses. Also, to effectively manage the scarce physical memory, it is divided into *pages*. Thus, applications operate on a set of pages that an operating system has provided. -Address translation is required for accessing data as well as code (instructions). The mechanism for a system with a page size of 4KB is shown in Figure @fig:VirtualMem. The virtual address is split into two parts. The virtual page number (52 most significant bits) is used to index into the page table to produce a mapping between the virtual page number and the corresponding physical page. To offset within a 4KB page we need 12 bits; as already stated, the other 52 bits of a 64-bit pointer are used for the address of the page itself. Notice that the offset within a page (12 least significant bits) does not require translation, and it is used "as-is" to access the physical memory location. +Virtual-to-physical address translation is required for accessing data as well as code (instructions). The translation mechanism for a system with a page size of 4KB is shown in Figure @fig:VirtualMem. The virtual address is split into two parts. The virtual page number (52 most significant bits) is used to index into the page table to produce a mapping between the virtual page number and the corresponding physical page. The 12 least significant bits are used to offset within a 4KB page. These bits do not require translation, and are used "as-is" to access the physical memory location. ![Virtual-to-physical address translation for 4KB pages.](../../img/uarch/VirtualMem.png){#fig:VirtualMem width=80%} @@ -12,13 +12,13 @@ The page table can be either single-level or nested. Figure @fig:L2PageTables sh ![Example of a 2-level page table.](../../img/uarch/L2PageTables.png){#fig:L2PageTables width=90%} -A nested page table is a radix tree that keeps physical page addresses along with some metadata. To find a translation for such a 2-level page table, we first use bits 32..47 as an index into the Level-1 page table also known as the *page table directory*. Every descriptor in the directory points to one of the 2^16^ blocks of Level-2 tables. Once we find the appropriate L2 block, we use bits 12..31 to find the physical page address. Concatenating it with the page offset (bits 0..11) gives us the physical address, which can be used to retrieve the data from the DRAM. +A nested page table is a radix tree that keeps physical page addresses along with some metadata. To find a translation within a 2-level page table, we first use bits 32..47 as an index into the Level-1 page table also known as the *page table directory*. Every descriptor in the directory points to one of the 2^16^ blocks of Level-2 tables. Once we find the appropriate L2 block, we use bits 12..31 to find the physical page address. Concatenating it with the page offset (bits 0..11) gives us the physical address, which can be used to retrieve the data from the DRAM. -The exact format of the page table is dictated by the CPU for reasons we will discuss a few paragraphs later. Thus the variations of page table organization are limited by what a CPU supports. Nowadays it is common to see 4- and 5-level page tables. Modern CPUs support both 4-level page tables with 48-bit pointers (256 TB of total memory) and 5-level page tables with 57-bit pointers (128 PB of total memory). +The exact format of the page table is dictated by the CPU for reasons we will discuss a few paragraphs later. Thus the variations of page table organization are limited by what a CPU supports. Modern CPUs support both 4-level page tables with 48-bit pointers (256 TB of total memory) and 5-level page tables with 57-bit pointers (128 PB of total memory). Breaking the page table into multiple levels doesn't change the total addressable memory. However, a nested approach does not require storing the entire page table as a contiguous array and does not allocate blocks that have no descriptors. This saves memory space but adds overhead when traversing the page table. -Failure to provide a physical address mapping is called a *page fault*. It occurs if a requested page is invalid or is not currently in main memory. The two most common reasons are: 1) the OS committed to allocating a page but hasn't yet backed it with a physical page, and 2) an accessed page was swapped out to disk and is not currently stored in RAM. +Failure to provide a physical address mapping is called a *page fault*. It occurs if a requested page is invalid or is not currently in the main memory. The two most common reasons are: 1) the OS committed to allocating a page but hasn't yet backed it with a physical page, and 2) an accessed page was swapped out to disk and is not currently stored in RAM. ### Translation Lookaside Buffer (TLB) {#sec:TLBs}