[Chapter4] Small fixes.

dendibakh · Sep 14, 2024 · 1c7aede · 1c7aede
1 parent 1ac2b43
commit 1c7aede
Show file tree

Hide file tree

Showing 2 changed files with 4 additions and 6 deletions.
diff --git a/chapters/4-Terminology-And-Metrics/4-15 Questions-Exercises.md b/chapters/4-Terminology-And-Metrics/4-15 Questions-Exercises.md
@@ -6,7 +6,7 @@
 2. What is the difference between retired and executed instructions?
 3. When you increase the frequency, does IPC go up, down, or stay the same?
 4. Take a look at the `DRAM BW Use` formula in Table {@tbl:perf_metrics}. Why do you think there is a constant `64`?
-5. Measure the bandwidth and latency of the cache hierarchy and memory on the machine you use for development/benchmarking using MLC, Stream or other tools.
+5. Measure the bandwidth and latency of the cache hierarchy and memory on the machine you use for development/benchmarking using Intel MLC, Stream or other tools.
 6. Run the application that you're working with on a daily basis. Collect performance metrics. Does anything surprise you?
 
 **Capacity Planning Exercise**: Imagine you are the owner of four applications we benchmarked in the case study. The management of your company has asked you to build a small computing farm for each of those applications with the primary goal being maximum performance (throughput). The spending budget you were given is tight but enough to buy 1 mid-level server system (Mac Studio, Supermicro/Dell/HPE server rack, etc.) or 1 high-end desktop (with overclocked CPU, liquid cooling, top GPU, fast DRAM) to run each workload, so 4 machines in total. Those could be all four different systems. Also, you can use the money to buy 3-4 low-end systems; the choice is yours. The management wants to keep it under $10,000 per application, but they are flexible (10--20%) if you can justify the expense. Assume that Stockfish remains single-threaded. Look at the performance characteristics for the four applications once again and write down which computer parts (CPU, memory, discrete GPU if needed) you would buy for each of those workloads. Which parameters you will prioritize? Where will you go with the most expensive part? Where you can save money? Try to describe it in as much detail as possible, and search the web for exact components and their prices. Account for all the components of the system: motherboard, disk drive, cooling solution, power delivery unit, rack/case/tower, etc. What additional performance experiments would you run to guide your decision?
diff --git a/chapters/4-Terminology-And-Metrics/4-7 Cache miss.md b/chapters/4-Terminology-And-Metrics/4-7 Cache miss.md
@@ -1,8 +1,6 @@
-
-
 ## Cache Miss
 
-As discussed in [@sec:MemHierar], any memory request missing in a particular level of cache must be serviced by higher-level caches or DRAM. This implies a significant increase in the latency of such memory access. The typical latency of memory subsystem components is shown in Table {@tbl:mem_latency}. There is also an [interactive view](https://colin-scott.github.io/personal_website/research/interactive_latency.html)[^1] that visualizes the latency of different operations in modern systems. Performance greatly suffers, especially when a memory request misses in the Last Level Cache (LLC) and goes all the way down to the main memory. Intel® [Memory Latency Checker](https://www.intel.com/software/mlc)[^2] (MLC) is a tool used to measure memory latencies and bandwidth and how they change with increasing load on the system. MLC is useful for establishing a baseline for the system under test and for performance analysis. We will use this tool when we talk about memory latency and bandwidth in [@sec:MemLatBw].
+As discussed in [@sec:MemHierar], any memory request missing in a particular level of cache must be serviced by higher-level caches or DRAM. This implies a significant increase in the latency of such memory access. The typical latency of memory subsystem components is shown in Table {@tbl:mem_latency}. There is also an [interactive view](https://colin-scott.github.io/personal_website/research/interactive_latency.html)[^1] that visualizes the latency of different operations in modern systems. Performance greatly suffers when a memory request misses in the Last Level Cache (LLC) and goes all the way down to the main memory.
 
 -------------------------------------------------
 Memory Hierarchy Component   Latency (cycle/time)
@@ -34,7 +32,7 @@ $ perf stat -e mem_load_retired.fb_hit,mem_load_retired.l1_miss,
   546230  mem_inst_retired.all_loads
 ```
 
-Above is the breakdown of all loads for the L1 data cache and fill buffers. A load might either hit the already allocated fill buffer (`fb_hit`), or hit the L1 cache (`l1_hit`), or miss both (`l1_miss`), thus `all_loads = fb_hit + l1_hit + l1_miss`. We can see that only 3.5% of all loads miss in the L1 cache, thus the *L1 hit rate* is 96.5%. 
+Above is the breakdown of all loads for the L1 data cache and fill buffers. A load might either hit the already allocated fill buffer (`fb_hit`), or hit the L1 cache (`l1_hit`), or miss both (`l1_miss`), thus `all_loads = fb_hit + l1_hit + l1_miss`.[^2] We can see that only 3.5% of all loads miss in the L1 cache, thus the *L1 hit rate* is 96.5%.
 
 We can further break down L1 data misses and analyze L2 cache behavior by running:
 
@@ -49,4 +47,4 @@ $ perf stat -e mem_load_retired.l1_miss,
 From this example, we can see that 37% of loads that missed in the L1 D-cache also missed in the L2 cache, thus the *L2 hit rate* is 63%. A breakdown for the L3 cache can be made similarly.
 
 [^1]: Interactive latency - [https://colin-scott.github.io/personal_website/research/interactive_latency.html](https://colin-scott.github.io/personal_website/research/interactive_latency.html)
-[^2]: Memory Latency Checker - [https://www.intel.com/software/mlc](https://www.intel.com/software/mlc)
+[^2]: Careful readers may notice discrepancy in the numbers: `fb_hit + l1_hit + l1_miss = 545,820`, which doesn't exactly match `all_loads`. We did not investigate this since the numbers are very close.