Skip to content

Commit

Permalink
[Chapter3] Cosmetics. part1
Browse files Browse the repository at this point in the history
  • Loading branch information
dendibakh committed Sep 6, 2024
1 parent 0aaeb4b commit ef0c14a
Show file tree
Hide file tree
Showing 8 changed files with 67 additions and 62 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@

# CPU Microarchitecture {#sec:uarch}

This chapter provides a brief summary of the critical CPU microarchitecture features that have a direct impact on software performance. The goal of this chapter is not to cover all the details and trade-offs of CPU architectures, which are already covered extensively in the literature [@Hennessy]. Instead, this chapter provides a quick recap of the CPU hardware features that are present in modern processors.
This chapter provides a brief summary of the critical CPU microarchitecture features that have a direct impact on software performance. The goal of this chapter is not to cover all the details and trade-offs of CPU architectures, which are already covered extensively in the literature [@Hennessy]. Instead, this chapter provides a recap of the CPU hardware features that are present in modern processors.
6 changes: 3 additions & 3 deletions chapters/3-CPU-Microarchitecture/3-1 ISA.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
## Instruction Set Architecture

The instruction set is the vocabulary used by software to communicate with the hardware. The instruction set architecture (ISA) defines the contract between the software and the hardware. Intel x86,[^1] ARM v8 and RISC-V are examples of current-day ISAs that are widely deployed. All of these are 64-bit architectures, i.e., all address computations use 64 bits. ISA developers and CPU architects typically ensure that software or firmware conforming to the specification will execute on any processor built using the specification. Widely deployed ISA franchises also typically ensure backward compatibility such that code written for the GenX version of a processor will continue to execute on GenX+i.
The instruction set architecture (ISA) is the contract between the software and the hardware, which defines the rules of communication. Intel x86-64,[^1] ARM v8 and RISC-V are examples of current-day ISAs that are widely deployed. All of these are 64-bit architectures, i.e., all address computations use 64 bits. ISA developers and CPU architects typically ensure that software or firmware conforming to the specification will execute on any processor built using the specification. Widely deployed ISAs also typically ensure backward compatibility such that code written for the GenX version of a processor will continue to execute on GenX+i.

Most modern architectures can be classified as general-purpose register-based, load-store architectures, such as RISC-V and ARM where the operands are explicitly specified, and memory is accessed only using load and store instructions. The X86 ISA is a register-memory architecture, where operations can be performed on registers, as well as memory operands. In addition to providing the basic functions in an ISA such as load, store, control and scalar arithmetic operations using integers and floating-point, the widely deployed architectures continue to enhance their ISA to support new computing paradigms. These include enhanced vector processing instructions (e.g., Intel AVX2, AVX512, ARM SVE, RISC-V "V" vector extension) and matrix/tensor instructions (Intel AMX, ARM SME). Software mapped to use these advanced instructions typically provides orders of magnitude improvement in performance.

Modern CPUs support 32-bit and 64-bit precision for arithmetic operations. With the fast-evolving field of machine learning and AI, the industry has a renewed interest in alternative numeric formats for variables to drive significant performance improvements. Research has shown that machine learning models perform just as well, using fewer bits to represent the variables, saving on both compute and memory bandwidth. As a result, several CPU franchises have recently added support for lower precision data types such as 8-bit integers (int8, e.g., Intel VNNI), 16-bit floating-point (fp16, bf16) in the ISA, in addition to the traditional 32-bit and 64-bit formats for arithmetic operations.
Modern CPUs support 32-bit and 64-bit precision for floating-point and integer arithmetic operations. With the fast-evolving field of machine learning and AI, the industry has a renewed interest in alternative numeric formats for variables to drive significant performance improvements. Research has shown that machine learning models perform just as well, using fewer bits to represent variables, saving on both compute and memory bandwidth. As a result, several CPU franchises have recently added support for lower precision data types such as 8-bit integers (int8), 16-bit floating-point (fp16, bf16) in the ISA, in addition to the traditional 32-bit and 64-bit formats for arithmetic operations.

[^1]: In the book we write x86, but mean x86-64, which is a 64-bit version of the x86 instruction set, first announced in 1999.
[^1]: In the book we sometimes write x86 for shortness, but we assume x86-64, which is a 64-bit version of the x86 instruction set, first announced in 1999.
19 changes: 9 additions & 10 deletions chapters/3-CPU-Microarchitecture/3-2 Pipelining.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,18 +12,20 @@ Pipelining is the foundational technique used to make CPUs fast wherein multiple

Figure @fig:Pipelining shows an ideal pipeline view of the 5-stage pipeline CPU. In cycle 1, instruction x enters the IF stage of the pipeline. In the next cycle, as instruction x moves to the ID stage, the next instruction in the program enters the IF stage, and so on. Once the pipeline is full, as in cycle 5 above, all pipeline stages of the CPU are busy working on different instructions. Without pipelining, instruction `x+1` couldn't start its execution until after instruction `x` had finished its work.

Modern high-performance CPUs have multiple pipeline stages, often ranging from 10 to 20 or more, depending on the architecture and design goals. This involves a much more complicated design than a simple 5-stage pipeline introduced earlier. For example, the decode stage may be split into several new stages, we may add new stages before the execute stage to buffer decoded instructions and so on.
Modern high-performance CPUs have multiple pipeline stages, often ranging from 10 to 20 or more, depending on the architecture and design goals. This involves a much more complicated design than a simple 5-stage pipeline introduced earlier. For example, the decode stage may be split into several new stages. We may also add new stages before the execute stage to buffer decoded instructions and so on.

The throughput of a pipelined CPU is defined as the number of instructions that complete and exit the pipeline per unit of time. The latency for any given instruction is the total time through all the stages of the pipeline. Since all the stages of the pipeline are linked together, each stage must be ready to move to the next instruction in lockstep. The time required to move an instruction from one stage to the next defines the basic machine cycle or clock for the CPU. The value chosen for the clock for a given pipeline is defined by the slowest stage of the pipeline. CPU hardware designers strive to balance the amount of work that can be done in a stage as this directly defines the frequency of operation of the CPU. Increasing the frequency improves performance and typically involves balancing and re-pipelining to eliminate bottlenecks caused by the slowest pipeline stages.
The *throughput* of a pipelined CPU is defined as the number of instructions that complete and exit the pipeline per unit of time. The *latency* for any given instruction is the total time through all the stages of the pipeline. Since all the stages of the pipeline are linked together, each stage must be ready to move to the next instruction in lockstep. The time required to move an instruction from one stage to the next defines the basic machine *cycle* or clock for the CPU. The value chosen for the clock for a given pipeline is defined by the slowest stage of the pipeline. CPU hardware designers strive to balance the amount of work that can be done in a stage as this directly affects the frequency of operation of the CPU.

In an ideal pipeline that is perfectly balanced and doesn’t incur any stalls, the time per instruction in the pipelined machine is given by
In an ideal pipeline that is perfectly balanced and doesn’t incur any stalls, the time per instruction in the pipelined machine is calculated as
$$
\textrm{Time per instruction on pipelined machine} = \frac{\textrm{Time per instr. on nonpipelined machine}}{\textrm{Number of pipe stages}}
$$
In real implementations, pipelining introduces several constraints that limit the ideal model shown above. Pipeline hazards prevent the ideal pipeline behavior, resulting in stalls. The three classes of hazards are structural hazards, data hazards, and control hazards. Luckily for the programmer, in modern CPUs, all classes of hazards are handled by the hardware.

\lstset{linewidth=10cm}

[TODO]: give example of a structural hazard

* **Structural hazards**: are caused by resource conflicts. To a large extent, they could be eliminated by replicating the hardware resources, such as using multi-ported registers or memories. However, eliminating all such hazards could potentially become quite expensive in terms of silicon area and power.

* **Data hazards**: are caused by data dependencies in the program and are classified into three types:
Expand All @@ -37,10 +39,10 @@ In real implementations, pipelining introduces several constraints that limit th

There is a RAW dependency for register R1. If we take the value directly after addition `R0 ADD 1` is done (from the `EXE` pipeline stage), we don't need to wait until the `WB` stage finishes, and the value will be written to the register file. Bypassing helps to save a few cycles. The longer the pipeline, the more effective bypassing becomes.

A *write-after-read* (WAR) hazard requires a dependent write to execute after a read. It occurs when instruction `x+1` writes a source before instruction `x` reads the source, resulting in the wrong new value being read. A WAR hazard is not a true dependency and is eliminated by a technique called [register renaming](https://en.wikipedia.org/wiki/Register_renaming).[^1] It is a technique that abstracts logical registers from physical registers. CPUs support register renaming by keeping a large number of physical registers. Logical (architectural) registers, the ones that are defined by the ISA, are just aliases over a wider register file. With such decoupling of the [architectural state](https://en.wikipedia.org/wiki/Architectural_state),[^3] solving WAR hazards is simple: we just need to use a different physical register for the write operation. For example:
A *write-after-read* (WAR) hazard requires a dependent write to execute after a read. It occurs when instruction `x+1` writes a source before instruction `x` reads the source, resulting in the wrong new value being read. A WAR hazard is not a true dependency and is eliminated by a technique called *register renaming*. It is a technique that abstracts logical registers from physical registers. CPUs support register renaming by keeping a large number of physical registers. Logical (*architectural*) registers, the ones that are defined by the ISA, are just aliases over a wider register file. With such decoupling of the *architectural state, solving WAR hazards is simple: we just need to use a different physical register for the write operation. For example:

```
; machine code ; after register renaming
; machine code, WAR hazard ; after register renaming
; (architectural registers) ; (physical registers)
R1 = R0 ADD 1 => R101 = R100 ADD 1
R0 = R2 ADD 2 R103 = R102 ADD 2
Expand All @@ -51,7 +53,7 @@ In real implementations, pipelining introduces several constraints that limit th
A *write-after-write* (WAW) hazard requires a dependent write to execute after a write. It occurs when instruction `x+1` writes a source before instruction `x` writes to the source, resulting in the wrong order of writes. WAW hazards are also eliminated by register renaming, allowing both writes to execute in any order while preserving the correct final result. Below is an example of eliminating WAW hazards.

```
; machine code ; after register renaming
; machine code, WAW hazard ; after register renaming
(architectural registers) (physical registers)
R2 = R1 => R102 = R101
R2 = 0 R103 = 0
Expand All @@ -61,7 +63,4 @@ In real implementations, pipelining introduces several constraints that limit th

* **Control hazards**: are caused due to changes in the program flow. They arise from pipelining branches and other instructions that change the program flow. The branch condition that determines the direction of the branch (taken vs. not taken) is resolved in the execute pipeline stage. As a result, the fetch of the next instruction cannot be pipelined unless the control hazard is eliminated. Techniques such as dynamic branch prediction and speculative execution described in the next section are used to overcome control hazards.

\lstset{linewidth=\textwidth}

[^1]: Register renaming - [https://en.wikipedia.org/wiki/Register_renaming](https://en.wikipedia.org/wiki/Register_renaming).
[^3]: Architectural state - [https://en.wikipedia.org/wiki/Architectural_state](https://en.wikipedia.org/wiki/Architectural_state).
\lstset{linewidth=\textwidth}
Loading

0 comments on commit ef0c14a

Please sign in to comment.