[Chapter 8] Cosmetic fixes in memory prefetching

dendibakh · Apr 5, 2024 · e9e5cf8 · e9e5cf8
1 parent e960ab1
commit e9e5cf8
Show file tree

Hide file tree

Showing 3 changed files with 13 additions and 11 deletions.
diff --git a/chapters/8-Optimizing-Memory-Accesses/8-6 Memory Prefetching.md b/chapters/8-Optimizing-Memory-Accesses/8-6 Memory Prefetching.md
@@ -4,14 +4,16 @@ typora-root-url: ..\..\img
 
 ## Explicit Memory Prefetching {#sec:memPrefetch}
 
-By now, you should know that memory accesses that are not resolved from caches are often very expensive. Modern CPUs try very hard to lower the penalty of cache misses if the prefetch request is issued sufficiently ahead of time. If the requested memory location is not in the cache, we will suffer the cache miss anyway as we have to go to the DRAM and fetch the data anyway. But if manage to bring that memory location in caches by the time the data is demanded by the program, then we effectively make the penalty of a cache miss to be zero.
+By now, you should know that memory accesses that are not resolved from caches are often very expensive. Modern CPUs try very hard to lower the penalty of cache misses by predicting which memory locations a program will access in the future and prefetch them ahead of time. If the requested memory location is not in the cache at the time a program demands it, we will suffer the cache miss penalty as we have to go to the DRAM and fetch the data anyway. But if we manage to bring that memory location in caches in time or if the request was predicted and data is underway, then the penalty of a cache miss will be much lower.
 
-Modern CPUs have two mechanisms for solving that problem: hardware prefetching and OOO execution. HW prefetchers help to hide the memory access latency by initiating prefetching requests on repetitive memory access patterns. While OOO engine looks N instructions into the future and issues loads early to allow smooth execution of future instructions that will demand this data.
+Modern CPUs have two mechanisms for solving that problem: hardware prefetching and OOO execution. HW prefetchers help to hide the memory access latency by initiating prefetching requests on repetitive memory access patterns. While OOO engine looks N instructions into the future and issues loads early to enable smooth execution of future instructions that will demand this data.
 
 HW prefetchers fail when data accesses patterns are too complicated to predict. And there is nothing SW developers can do about it as we cannot control the behavior of this unit. On the other hand, OOO engine does not try to predict memory locations that will be needed in the future as HW prefetching does. So, the only measure of success for it is how much latency it was able to hide by scheduling the load in advance.
 
 Consider a small snippet of code in [@lst:MemPrefetch1], where `arr` is an array of one million integers. The index `idx`, which is assigned to a random value, is immediately used to access a location in `arr`, which almost certainly misses in caches as it is random. It is impossible for a HW prefetcher to predict as every time the load goes to a completely new place in memory. The interval from the time the address of a memory location is known (returned from the function `random_distribution`) until the value of that memory location is demanded (call to `doSomeExtensiveComputation`) is called *prefetching window*. In this example, the OOO engine doesn't have the opportunity to issue the load early since the prefetching window is very small. This leads to the latency of the memory access `arr[idx]` to stand on a critical path while executing the loop as shown in Figure @fig:SWmemprefetch1. It's visible that the program waits for the value to come back (hatched fill rectangle) without making forward progress.
 
+You're probably thinking: "but the next iteration of the loop should start executing speculatively in parallel". That's true, and indeed, it is reflected in Figure @fig:SWmemprefetch1. The `doSomeExtensiveComputation` function requires a lot of work, and when execution gets closer to the finish of the first iteration, a CPU speculatively starts executing instruction from the second iteration. It creates a positive overlap in the execution between iterations. In fact, we presented an optimistic scenario where a processor was able to generate the next random number and issued a load in parallel with the previous iteration of the loop. However, a CPU wasn't able to fully hide the latency of the load, because it CPU cannot look that far ahead of current execution to issue the load early enough. Maybe future processors will have more powerfull OOO engines, but for now, there are cases where a programmer's intervention is needed.
+
 Listing: Random number feeds a subsequent load.
 
 ~~~~ {#lst:MemPrefetch1 .cpp}
@@ -22,11 +24,9 @@ for (int i = 0; i < N; ++i) {
 }
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-![Execution timeline that shows the load latency standing on a critical path.](../../img/memory-access-opts/SWmemprefetch1.png){#fig:SWmemprefetch1 width=90%}
-
-There is another important observation here. When a CPU gets close to finish running the first iteration, it speculatively starts executing instruction from the second iteration. It creates a positive overlap in the execution between iterations. However, even in modern processors, there are not enough OOO capabilities to fully overlap the latency of a cache miss with executing `doSomeExtensiveComputation` from the iteration1. In other words, in our case a CPU cannot look that far ahead of current execution to issue the load early enough.
+![Execution timeline that shows the load latency standing on a critical path.](../../img/memory-access-opts/SWmemprefetch1.png){#fig:SWmemprefetch1 width=80%}
 
-Luckily, it's not a dead end as there is a way to speed up this code. To hide the latency of a cache miss, we need to overlap it with execution of `doSomeExtensiveComputation`. We can achieve it if we pipeline generation of random numbers and start prefetching the memory location for the next iteration as shown in [@lst:MemPrefetch2]. Notice the usage of [`__builtin_prefetch`](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html),[^4] a special hint that developers can use to explicitly request a CPU to prefetch a certain memory location. Graphical illustration of this transformation is illustrated in Figure @fig:SWmemprefetch2.
+Luckily, it's not a dead end as there is a way to speed up this code by fully overlapping the load with execution of `doSomeExtensiveComputation`, which will hide the latency of a cache miss. We can achieve it with techniques called *software pipelining* and *explicit memory prefetching*. Implementation of this idea is shown in [@lst:MemPrefetch2]. We pipeline generation of random numbers and start prefetching memory location for the next iteration in parallel with `doSomeExtensiveComputation`.
 
 Listing: Utilizing Exlicit Software Memory Prefetching hints.
 
@@ -41,17 +41,19 @@ for (int i = 0; i < N; ++i) {
 }
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-![Hiding the cache miss latency by overlapping it with other execution.](../../img/memory-access-opts/SWmemprefetch2.png){#fig:SWmemprefetch2 width=90%}
+Graphical illustration of this transformation is illustrated in Figure @fig:SWmemprefetch2. We utilized software pipelining to generate random number for the next iteration. In other words, on iteration `M`, we produce a random number that will be consumed on iteration `M+1`. This enables us to issue the memory request early since we already know the next index in the array. This transformation makes out prefetching window much larger, and fully hides the latency of a cache miss. On the iteration `M+1`, the actual load has a very high chance to hit caches, because it was prefetched on iteration `M`.
+
+![Hiding the cache miss latency by overlapping it with other execution.](../../img/memory-access-opts/SWmemprefetch2.png){#fig:SWmemprefetch2 width=80%}
 
-Another option to utilize explicit SW prefetching on x86 platforms is to use compiler intrinsics `_mm_prefetch` intrinsic. See Intel Intrinsics Guide for more details. In any case, compiler will compile it down to machine instruction: `PREFETCH` for x86 and `pld` for ARM. For some platforms compiler can skip inserting an instruction, so it is a good idea to check the generated machine code.
+Notice the usage of [`__builtin_prefetch`](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html),[^4] a special hint that developers can use to explicitly request a CPU to prefetch a certain memory location. Another option to to use compiler intrinsics. For example, on x86 platforms ther is `_mm_prefetch` intrinsic, see Intel Intrinsics Guide for more details. In any case, compiler will generate `PREFETCH` instruction for x86 and `pld` instruction for ARM. For some platforms compiler can skip inserting an instruction, so it is a good idea to check the generated machine code.
 
 There are situations when SW memory prefetching is not possible. For example, when traversing a linked list, prefetching window is tiny and it is not possible to hide the latency of pointer chaising.
 
-In [@lst:MemPrefetch2] we saw an example of prefetching for the next iteration, but also you may frequently encounter a need to prefetch for 2, 4, 8, and sometimes even more iterations. The code in [@lst:MemPrefetch3] is one of those cases, when it could be beneficial. If the graph is very sparse and has a lot of verticies, it is very likely that accesses to `this->out_neighbors` and `this->in_neighbors` vectors will miss in caches a lot.
+In [@lst:MemPrefetch2] we saw an example of prefetching for the next iteration, but also you may frequently encounter a need to prefetch for 2, 4, 8, and sometimes even more iterations. The code in [@lst:MemPrefetch3] is one of those cases, when it could be beneficial. It presents a typical code for populating a graph with edges. If the graph is very sparse and has a lot of verticies, it is very likely that accesses to `this->out_neighbors` and `this->in_neighbors` vectors will miss in caches a lot. This happens because every edge is likely to connect new verticies that are not currently in caches.
 
 This code is different from the previous example as there are no extensive computations on every iteration, so the penalty of cache misses likely dominates the latency of each iteration. But we can leverage the fact that we know all the elements that will be accessed in the future. The elements of vector `edges` are accessed sequentially and thus are likely to be timely brought to the L1 cache by the HW prefetcher. Our goal here is to overlap the latency of a cache miss with executing enough iterations to completely hide it.
 
-As a general rule, for prefetch hints to be effective, they must be inserted well ahead of time so that by the time the loaded value will be used in other calculations, it will be already in the cache. However, it also shouldn't be inserted too early since it may pollute the cache with the data that is not used for a long time. Notice, in [@lst:MemPrefetch3], `lookAhead` is a template parameter, which allows to try different values and see which gives the best performance. More advanced users can try to estimate the prefetching window using the method described in [@sec:timed_lbr], example of using such method can be found on easyperf blog. [^5]
+As a general rule, for a prefetch hint to be effective, it must be inserted well ahead of time so that by the time the loaded value will be used in other calculations, it will be already in the cache. However, it also shouldn't be inserted too early since it may pollute the cache with the data that is not used for a long time. Notice, in [@lst:MemPrefetch3], `lookAhead` is a template parameter, which allows to try different values and see which gives the best performance. More advanced users can try to estimate the prefetching window using the method described in [@sec:timed_lbr], example of using such method can be found on easyperf blog. [^5]
 
 Listing: Example of a SW prefetching for the next 8 iterations.
 
@@ -74,7 +76,7 @@ void Graph::update(const std::vector<Edge>& edges) {
 }
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-SW memory prefetching is most frequently used in the loops, but also one can insert those hints into the parent function, again, all depends on the available prefetching window.
+Explicit memory prefetching is most frequently used in loops, but also one can insert those hints into the parent function, again, all depends on the available prefetching window.
 
 This technique is a powerful weapon, however, it should be used with extreme care as it is not easy to get it right. First of all, explicit memory prefetching is not portable, meaning that if it gives performance gains on one platform, it doesn't guarantee similar speedups on another platform. It is very implemetation-specific and platforms are not required to honor those hints. In such a case it will likely degrade performance. My recomendation would be to verify that the impact is positive with all available tools. Not only check the performance numbers, but also make sure that the number of cache misses (L3 in particular) went down. Once the change is committed into the code base, monitor performance on all the platforms that you run your application on, as it could be very sensitive to changes in the surrounding code. Consider dropping the idea if the benefits do not overweight the potential maintanance burden.
 

diff --git a/img/memory-access-opts/SWmemprefetch1.png b/img/memory-access-opts/SWmemprefetch1.png
diff --git a/img/memory-access-opts/SWmemprefetch2.png b/img/memory-access-opts/SWmemprefetch2.png