-
-
Notifications
You must be signed in to change notification settings - Fork 187
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Chapter8] Wrote about mem bw limitations
- Loading branch information
Showing
2 changed files
with
14 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
15 changes: 13 additions & 2 deletions
15
...ers/8-Optimizing-Memory-Accesses/8-4 Workaround Memory Bandwidth Limitations.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,15 @@ | ||
## Workaround Memory Bandwidth Limitations | ||
|
||
[TODO]: Discuss what to do when memory bandwidth is a limitation | ||
Use smaller data types, e.g. fp16, or buy a better HW. | ||
As discussed in [@sec:UarchMainmemory], a processor gets the data from memory through a memory bus. With the latest DDR5 memory technology, the maximum theoretical memory bandwidth is 51.2 GB/s per channel. Modern systems have multiple memory channels; for example, a typical laptop usually has two memory channels, while server systems can have from 4 to 12 channels. It may seem that even a laptop can move a lot of data back and forth each second, but in reality, memory bandwidth becomes a limitation in many applications. | ||
|
||
We should keep in mind that memory channels are shared between all the cores in a system. Once many cores engage in a memory-intensive activity, the traffic flowing through the memory bus can become congested. This may lead to increased wait times for memory requests to return. Modern systems are designed to accommodate multiple memory-demanding threads working at the same time, so it's not possible to saturate the memory bandwidth with just a single HW thread. However, on some machine configurations, it is possible with just 2-4 threads. Emerging AI workloads are known to be extremely "memory hungry" and highly parallelized, so memory bandwidth is the number one bottleneck for them. | ||
|
||
The first step in addressing memory bandwidth limitations is to determine the maximum theoretical and expected memory bandwidth. The theoretical maximum memory bandwidth can be calculated from the memory technology specifications as we have shown in [@sec:roofline]. The expected memory bandwidth can be measured using tools like Intel Memory Latency Checker or `lmbench`, which we discussed in [@sec:MemLatBw]. Intel Vtune can automatically measure memory bandwidth before the analyzed application starts. | ||
|
||
The second step is to measure the memory bandwidth utilization while your application is running. If the amount of memory traffic is close to the maximum measured bandwidth, then the performance of your application is likely to be bound by memory bandwidth. It is a good idea to plot the memory bandwidth utilization over time to see if there are different phases where memory intensity spikes or takes a dip. Intel Vtune can provide such a chart if you tick the "Evaluate max DRAM bandwidth" checkbox in the analysis configuration. | ||
|
||
If you have determined that your application is memory bandwidth bound, the first suggestion is to see if you can decrease the memory intensity of your application. It is not always possible, but you can consider disabling some memory-hungry features of your application, recomputing data on the fly instead of caching results, or compressing your data. In the AI space, most Large Language Models (LLMs) are supplied in fp32 precision, which means that each parameter takes 4 bytes. The biggest performance gain can be achieved with quantization techniques, which reduce the precision of the parameters to fp16 or int8. This will reduce the memory traffic by 2x or 4x, respectively. Sometimes, 4-bit and even 5-bit quantization are used, all to reduce memory traffic and strike the right balance between inference performance and quality. | ||
|
||
It is important to mention that for workloads that saturate available memory bandwidth, code optimizations don't play a similar big role as for compute-bound workloads. For compute-bound applications, code optimizations like vectorization usually translate into large performance gains. However, for memory-bound workloads, vectorization may not have a similar effect since a processor can't make forward progress, simply because it lacks data to work with. We cannot make the memory bus run faster, this is why memory bandwidth is often a hard limitation to overcome. | ||
|
||
Finally, if all the options have been exhausted, and the memory bandwidth is still a limitation, the only way to improve the situation is to buy better hardware. You can invest money in a server with more memory channels, or DRAM modules with faster transfer speed. This could be an expensive but still, a viable option to speed up your application. |