Skip to content

Commit

Permalink
[Chapter8] Wrote about mem bw limitations
Browse files Browse the repository at this point in the history
  • Loading branch information
dendibakh committed Mar 21, 2024
1 parent 4086272 commit 4bd313a
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 3 deletions.
2 changes: 1 addition & 1 deletion chapters/3-CPU-Microarchitecture/3-6 Memory Hierarchy.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ Hardware prefetchers observe the behavior of a running application and initiate

Software memory prefetching complements prefetching done by HW. Developers can specify which memory locations are needed ahead of time via dedicated HW instruction (see [@sec:memPrefetch]). Compilers can also automatically add prefetch instructions into the code to request data before it is required. Prefetch techniques need to balance between demand and prefetch requests to guard against prefetch traffic slowing down demand traffic.

### Main Memory
### Main Memory {#sec:UarchMainmemory}

Main memory is the next level of the hierarchy, downstream from the caches. Requests to load and store data are initiated by the Memory Controller Unit (MCU). In the past, this circuit was located in the north bridge chip on the motherboard. But nowadays, most processors have this component embedded, so the CPU has a dedicated memory bus connecting it to main memory.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,15 @@
## Workaround Memory Bandwidth Limitations

[TODO]: Discuss what to do when memory bandwidth is a limitation
Use smaller data types, e.g. fp16, or buy a better HW.
As discussed in [@sec:UarchMainmemory], a processor gets the data from memory through a memory bus. With the latest DDR5 memory technology, the maximum theoretical memory bandwidth is 51.2 GB/s per channel. Modern systems have multiple memory channels; for example, a typical laptop usually has two memory channels, while server systems can have from 4 to 12 channels. It may seem that even a laptop can move a lot of data back and forth each second, but in reality, memory bandwidth becomes a limitation in many applications.

We should keep in mind that memory channels are shared between all the cores in a system. Once many cores engage in a memory-intensive activity, the traffic flowing through the memory bus can become congested. This may lead to increased wait times for memory requests to return. Modern systems are designed to accommodate multiple memory-demanding threads working at the same time, so it's not possible to saturate the memory bandwidth with just a single HW thread. However, on some machine configurations, it is possible with just 2-4 threads. Emerging AI workloads are known to be extremely "memory hungry" and highly parallelized, so memory bandwidth is the number one bottleneck for them.

The first step in addressing memory bandwidth limitations is to determine the maximum theoretical and expected memory bandwidth. The theoretical maximum memory bandwidth can be calculated from the memory technology specifications as we have shown in [@sec:roofline]. The expected memory bandwidth can be measured using tools like Intel Memory Latency Checker or `lmbench`, which we discussed in [@sec:MemLatBw]. Intel Vtune can automatically measure memory bandwidth before the analyzed application starts.

The second step is to measure the memory bandwidth utilization while your application is running. If the amount of memory traffic is close to the maximum measured bandwidth, then the performance of your application is likely to be bound by memory bandwidth. It is a good idea to plot the memory bandwidth utilization over time to see if there are different phases where memory intensity spikes or takes a dip. Intel Vtune can provide such a chart if you tick the "Evaluate max DRAM bandwidth" checkbox in the analysis configuration.

If you have determined that your application is memory bandwidth bound, the first suggestion is to see if you can decrease the memory intensity of your application. It is not always possible, but you can consider disabling some memory-hungry features of your application, recomputing data on the fly instead of caching results, or compressing your data. In the AI space, most Large Language Models (LLMs) are supplied in fp32 precision, which means that each parameter takes 4 bytes. The biggest performance gain can be achieved with quantization techniques, which reduce the precision of the parameters to fp16 or int8. This will reduce the memory traffic by 2x or 4x, respectively. Sometimes, 4-bit and even 5-bit quantization are used, all to reduce memory traffic and strike the right balance between inference performance and quality.

It is important to mention that for workloads that saturate available memory bandwidth, code optimizations don't play a similar big role as for compute-bound workloads. For compute-bound applications, code optimizations like vectorization usually translate into large performance gains. However, for memory-bound workloads, vectorization may not have a similar effect since a processor can't make forward progress, simply because it lacks data to work with. We cannot make the memory bus run faster, this is why memory bandwidth is often a hard limitation to overcome.

Finally, if all the options have been exhausted, and the memory bandwidth is still a limitation, the only way to improve the situation is to buy better hardware. You can invest money in a server with more memory channels, or DRAM modules with faster transfer speed. This could be an expensive but still, a viable option to speed up your application.

0 comments on commit 4bd313a

Please sign in to comment.