Skip to content

Commit

Permalink
Working on chapter 4 case study
Browse files Browse the repository at this point in the history
  • Loading branch information
dbakhval authored and dbakhval committed Feb 22, 2023
1 parent 47e939e commit dc9c5c1
Show file tree
Hide file tree
Showing 4 changed files with 104 additions and 110 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
## Case Study {#sec:PerfMetricsCaseStudy}

Putting together everything we discussed so far in this chapter, we run four benchmarks from different domains and calculated their performance metrics. First of all, let's introduce the benchmarks.

1. Blender 3.4 - an open-source 3D creation and modeling software project. This test is of Blender's Cycles performance with BMW27 blend file. All HW threads are used. URL: https://download.blender.org/release/. Command line: `./blender -b bmw27_cpu.blend -noaudio --enable-autoexec -o output.test -x 1 -F JPEG -f 1`.
2. Stockfish 15 - an advanced open-source chess engine. This test is a stockfish built-in benchmark. A single HW thread is used. URL: https://stockfishchess.org/. Command line: `./stockfish bench 128 1 24 default depth`.
3. Clang 15 selfbuild - this test uses clang 15 to build clang 15 compiler from sources. All HW threads are used. URL: https://www.llvm.org/. Command line: `ninja -j16 clang`.
4. CloverLeaf 2018 - a Lagrangian-Eulerian hydrodynamics benchmark. All HW threads are used. This test uses clover_bm.in input file (Problem 5). URL: http://uk-mac.github.io/CloverLeaf/. Command line: `./clover_leaf`.

Machine characteristics:

* 12th Gen Alderlake Intel(R) Core(TM) i7-1260P CPU @ 2.10GHz (4.70GHz Turbo), 4P+8E cores, 18MB L3-cache
* 16 GB RAM, DDR4 @ 2400 MT/s
* 256GB NVMe PCIe M.2 SSD
* 64-bit Ubuntu 22.04.1 LTS (Jammy Jellyfish)

To collect performance metrics, we use `toplev.py` script that is a part of [pmu-tools](https://github.com/andikleen/pmu-tools)[^1] written by Andi Kleen:

```bash
$ ~/workspace/pmu-tools/toplev.py -m --global --no-desc -v -- <app with args>
```

--------------------------------------------------------------------------
Metric Core Blender Stockfish Clang15- CloverLeaf
Name Type selfbuild
-------------- ------- ------------ ------------ ------------ ------------
Instructions P-core 6.02E+12 6.59E+11 2.40E+13 1.06E+12

Core Cycles P-core 4.31E+12 3.65E+11 3.78E+13 5.25E+12

IPC P-core 1.40 1.80 0.64 0.20

CPI P-core 0.72 0.55 1.57 4.96

Instructions E-core 4.97E+12 0 1.43E+13 1.11E+12

Core Cycles E-core 3.73E+12 0 3.19E+13 4.28E+12

IPC E-core 1.33 0 0.45 0.26

CPI E-core 0.75 0 2.23 3.85

L1MPKI P-core 3.88 21.38 6.01 13.44

L2MPKI P-core 0.15 1.67 1.09 3.58

L3MPKI P-core 0.04 0.14 0.56 3.43

Branch Mispred E-core 0.02 0.08 0.03 0.01
Ratio

Code STLB MPKI P-core 0 0.01 0.35 0.01

Load STLB MPKI P-core 0.08 0.04 0.51 0.03

Store STLB P-core 0 0.01 0.06 0.1
MPKI

Load Miss Real P-core 12.92 10.37 76.7 253.89
Latency

ILP P-core 3.67 3.65 2.93 2.53

MLP P-core 1.61 2.62 1.57 2.78

DRAM BW Use All 1.58 1.42 10.67 24.57

IpCall All 176.8 153.5 40.9 2,729

IpBranch All 9.8 10.1 5.1 18.8

IpLoad All 3.2 3.3 3.6 2.7

IpStore All 7.2 7.7 5.9 22.0

IpMispredict All 610.4 214.7 177.7 2,416

IpFLOP All 1.1 1.82E+06 286,348 1.8

IpArith All 4.5 7.96E+06 268,637 2.1

IpArith All 22.9 4.07E+09 280,583 2.60E+09
Scalar SP

IpArith All 438.2 1.22E+07 4.65E+06 2.2
Scalar DP

IpArith AVX128 All 6.9 0.0 1.09E+10 1.62E+09

IpArith AVX256 All 30.3 0.0 0.0 39.6

IpSWPF All 90.2 2,565 105,933 172,348
--------------------------------------------------------------------------

Table: A case study. {#tbl:perf_metrics_case_study}


[^1]: pmu-tools - [https://github.com/andikleen/pmu-tools](https://github.com/andikleen/pmu-tools).
6 changes: 3 additions & 3 deletions chapters/4-Terminology-And-Metrics/4-9 Performance Metrics.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Performance Metrics {#sec:PerfMetrics}

In addition to the performance events that we discussed earlier in this chapter, performance engineers frequently use metrics, which are built on top of raw events. Table {@tbl:secondary_metrics} shows a list of metrics for Intel Alderlake platform along with descriptions and formulas. The list is not exhaustive, but it shows the most important metrics. Complete list of metrics for Intel CPUs and their formulas can be found in [TMA_metrics.xlsx](https://github.com/intel/perfmon/blob/main/TMA_Metrics.xlsx)[^1]. The last section in this chapter shows how performance metrics can be used in practice.
In addition to the performance events that we discussed earlier in this chapter, performance engineers frequently use metrics, which are built on top of raw events. Table {@tbl:perf_metrics} shows a list of metrics for Intel's 12th-gen Goldencove architecture along with descriptions and formulas. The list is not exhaustive, but it shows the most important metrics. Complete list of metrics for Intel CPUs and their formulas can be found in [TMA_metrics.xlsx](https://github.com/intel/perfmon/blob/main/TMA_Metrics.xlsx)[^1]. The last section in this chapter shows how performance metrics can be used in practice.

--------------------------------------------------------------------------
Metric Description Formula
Expand Down Expand Up @@ -100,9 +100,9 @@ SWPF prefetch instruction SW_PREFETCH_ACCESS.T0:u0xF
(of any type)
--------------------------------------------------------------------------

Table: A list (not exhaustive) of secondary metrics along with descriptions and formulas for Intel Alderlake platforms. {#tbl:secondary_metrics}
Table: A list (not exhaustive) of secondary metrics along with descriptions and formulas for Intel Goldencove architecture. {#tbl:perf_metrics}

A few notes on those metrics. First, ILP and MLP metrics do not represent theoretical maximums for an application, rather they measure ILP and MLP on a particular machine. On an ideal machine with infinite resources numbers will be higher. Second, all metrics besides "DRAM BW Use" are fractions. We can apply fairly straightforward reasoning to each of them to tell whether a particular metric is high or low. But to make sense of "DRAM BW Use" metric, we would like to know if a program saturates the memory bandwidth or not. We will discuss how to find out peak DRAM BW in the next section.
A few notes on those metrics. First, ILP and MLP metrics do not represent theoretical maximums for an application, rather they measure ILP and MLP on a given machine. On an ideal machine with infinite resources numbers will be higher. Second, all metrics besides "DRAM BW Use" and "Load Miss Real Latency" are fractions; we can apply fairly straightforward reasoning to each of them to tell whether a specific metric is high or low. But to make sense of "DRAM BW Use" and "Load Miss Real Latency" metrics, we need to put it in a context. For the former, we would like to know if a program saturates the memory bandwidth or not. The latter gives you an idea for the average cost of a cache miss, which is useless by itself unless you know the latencies of the cache hierarchy. We will discuss how to find out cache latencies and peak memory bandwidth in the next section.

Formulas in the table give an intuition on how performance metrics are calculated, so that you can build similar metrics on another platform as long as underlying performance events are available there. Some tools can report performance metrics automatically. If not, you can always calculate those metrics manually since you know the formulas and corresponding performance events that must be collected.

Expand Down
104 changes: 0 additions & 104 deletions chapters/4-Terminology-And-Metrics/4-9 Secondary Metrics.md

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ In a real-world application, performance could be limited by multiple factors. E

The top two-levels of TMA metrics are expressed in the percentage of all pipeline slots (see [@sec:PipelineSlot]) that were available during the execution of the program. It allows TMA to give an accurate representation of CPU microarchitecture utilization, taking into account the full bandwidth of the processor.

After we identified the performance bottleneck in the program, we would be interested to know where exactly in the code it is happening. The second stage of TMA is locating the source of the problem down to the exact line of code and assembly instruction. Analysis methodology provides exact PMC that one should use for each category of the performance problem. Then the developer can use this PMC to find the area in the source code that contributes to the most critical performance bottleneck identified by the first stage. This correspondence can be found in [TMA metrics](https://download.01.org/perfmon/TMA_Metrics.xlsx)[^2] table in "Locate-with" column. For example, to locate the bottleneck associated with a high `DRAM_Bound` metric in an application running on the Intel Skylake processor, one should sample on `MEM_LOAD_RETIRED.L3_MISS_PS` performance event.
After we identified the performance bottleneck in the program, we would be interested to know where exactly in the code it is happening. The second stage of TMA is locating the source of the problem down to the exact line of code and assembly instruction. Analysis methodology provides exact PMC that one should use for each category of the performance problem. Then the developer can use this PMC to find the area in the source code that contributes to the most critical performance bottleneck identified by the first stage. This correspondence can be found in [TMA metrics](https://github.com/intel/perfmon/blob/main/TMA_Metrics.xlsx)[^2] table in "Locate-with" column. For example, to locate the bottleneck associated with a high `DRAM_Bound` metric in an application running on the Intel Skylake processor, one should sample on `MEM_LOAD_RETIRED.L3_MISS_PS` performance event.

### TMA in Intel® VTune™ Profiler

Expand Down Expand Up @@ -130,7 +130,7 @@ According to the definition of `CYCLE_ACTIVITY.STALLS_L3_MISS`, it counts cycles

As the second step in the TMA process, we would locate the place in the code where the bottleneck occurs most frequently. In order to do so, one should sample the workload using a performance event that corresponds to the type of bottleneck that was identified during Step 1.

A recommended way to find such an event is to run `toplev` tool with the `--show-sample` option that will suggest the `perf record` command line that can be used to locate the issue. For the purpose of understanding the mechanics of TMA, we also present the manual way to find an event associated with a particular performance bottleneck. Correspondence between performance bottlenecks and performance events that should be used for locating the place in the code where such bottlenecks take place can be done with the help of [TMA metrics](https://download.01.org/perfmon/TMA_Metrics.xlsx)[^2] table introduced earlier in the chapter. The `Locate-with` column denotes a performance event that should be used to locate the exact place in the code where the issue occurs. For the purpose of our example, in order to find memory accesses that contribute to such a high value of the `DRAM_Bound` metric (miss in the L3 cache), we should sample on `MEM_LOAD_RETIRED.L3_MISS_PS` precise event as shown in the listing below:
A recommended way to find such an event is to run `toplev` tool with the `--show-sample` option that will suggest the `perf record` command line that can be used to locate the issue. For the purpose of understanding the mechanics of TMA, we also present the manual way to find an event associated with a particular performance bottleneck. Correspondence between performance bottlenecks and performance events that should be used for locating the place in the code where such bottlenecks take place can be done with the help of TMA metrics table introduced earlier in the chapter. The `Locate-with` column denotes a performance event that should be used to locate the exact place in the code where the issue occurs. For the purpose of our example, in order to find memory accesses that contribute to such a high value of the `DRAM_Bound` metric (miss in the L3 cache), we should sample on `MEM_LOAD_RETIRED.L3_MISS_PS` precise event as shown in the listing below:

```bash
$ perf record -e cpu/event=0xd1,umask=0x20,name=MEM_LOAD_RETIRED.L3_MISS/ppp ./a.out
Expand Down Expand Up @@ -232,7 +232,7 @@ At the time of this writing, the first level of TMA metrics is also available on
- Toplev manual, URL: [https://github.com/andikleen/pmu-tools/wiki/toplev-manual](https://github.com/andikleen/pmu-tools/wiki/toplev-manual).
- Understanding How General Exploration Works in Intel® VTune™ Profiler, URL: [https://software.intel.com/en-us/articles/understanding-how-general-exploration-works-in-intel-vtune-amplifier-xe](https://software.intel.com/en-us/articles/understanding-how-general-exploration-works-in-intel-vtune-amplifier-xe).
[^2]: TMA metrics - [https://download.01.org/perfmon/TMA_Metrics.xlsx](https://download.01.org/perfmon/TMA_Metrics.xlsx).
[^2]: TMA metrics - [https://github.com/intel/perfmon/blob/main/TMA_Metrics.xlsx](https://github.com/intel/perfmon/blob/main/TMA_Metrics.xlsx).
[^3]: VTune microarchitecture analysis - [https://software.intel.com/en-us/vtune-help-general-exploration-analysis](https://software.intel.com/en-us/vtune-help-general-exploration-analysis). In pre-2019 versions of Intel® VTune Profiler, it was called as “General Exploration” analysis.
[^4]: 7zip benchmark - [https://github.com/llvm-mirror/test-suite/tree/master/MultiSource/Benchmarks/7zip](https://github.com/llvm-mirror/test-suite/tree/master/MultiSource/Benchmarks/7zip).
[^5]: Linux `perf stat` manual page - [http://man7.org/linux/man-pages/man1/perf-stat.1.html#STAT_REPORT](http://man7.org/linux/man-pages/man1/perf-stat.1.html#STAT_REPORT).
Expand Down

0 comments on commit dc9c5c1

Please sign in to comment.