-
-
Notifications
You must be signed in to change notification settings - Fork 192
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
5 additions
and
7 deletions.
There are no files selected for viewing
12 changes: 5 additions & 7 deletions
12
chapters/6-CPU-Features-For-Performance-Analysis/6-9 Chapter Summary.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,13 @@ | ||
|
||
|
||
## Chapter Summary {.unlisted .unnumbered} | ||
|
||
* Utilizing hardware features for low-level tuning is recommended only once all high-level performance issues are fixed. Tuning poorly designed algorithms is a bad investment of time. Once all the major performance problems get eliminated, you can use CPU performance monitoring features to analyze and further tune your application. | ||
* Utilizing hardware features for low-level tuning is recommended only once all high-level performance issues are fixed. Tuning poorly designed algorithms is a bad investment of time. Once all the major performance problems are eliminated, you can use CPU performance monitoring features to analyze and further tune your application. | ||
* Top-down Microarchitecture Analysis (TMA) methodology is a very powerful technique for identifying ineffective usage of CPU microarchitecture by the program. It is a robust and formal methodology that is easy to use even for inexperienced developers. TMA is an iterative process that consists of multiple steps, including characterizing the workload and locating the exact place in the source code where the bottleneck occurs. We advise that TMA should be one of the starting points for every low-level tuning effort. | ||
* Branch Record mechanisms such as Intel's LBR, AMD's LBR, and ARM's BRBE continuously log the most recent branch outcomes in parallel with executing the program, causing a minimal slowdown. One of the primary usages of these facilities is to collect call stacks. Also, they help identify hot branches, misprediction rates and enable precise timing of machine code. | ||
* Modern processors often provide Hardware-Based Sampling features for advanced profiling. Such features lower the sampling overhead by storing multiple samples to a dedicated buffer without software interrupts. They also introduce "Precise Events" that enable pinpointing the exact instruction that caused a particular performance event. In addition, there are several other less important use cases. Example implementations of such Hardware-Based Sampling features include Intel's PEBS, AMD's IBS, and ARM's SPE. | ||
* Intel Processor Traces (PT) is a CPU feature that records the program execution by encoding packets in a highly compressed binary format that can be used to reconstruct execution flow with a timestamp on every instruction. PT has extensive coverage and relatively small overhead. Its main usages are postmortem analysis and finding the root cause(s) of performance glitches. Intel PT feature is covered in Appendix D. Processors based on ARM architecture also have a tracing capability called ARM [CoreSight](https://developer.arm.com/ip-products/system-ip/coresight-debug-and-trace),[^2] but it is mostly used for debugging rather than for performance analysis. | ||
* Branch Record mechanisms such as Intel's LBR, AMD's LBR, and ARM's BRBE continuously log the most recent branch outcomes in parallel with executing the program, causing a minimal slowdown. One of the primary usages of these facilities is to collect call stacks. Also, they help identify hot branches, and misprediction rates and enable precise timing of machine code. | ||
* Modern processors often provide Hardware-Based Sampling features for advanced profiling. Such features lower the sampling overhead by storing multiple samples in a dedicated buffer without software interrupts. They also introduce "Precise Events" that enable pinpointing the exact instruction that caused a particular performance event. In addition, there are several other less important use cases. Example implementations of such Hardware-Based Sampling features include Intel's PEBS, AMD's IBS, and ARM's SPE. | ||
* Intel Processor Traces (PT) is a CPU feature that records the program execution by encoding packets in a highly compressed binary format that can be used to reconstruct execution flow with a timestamp on every instruction. PT has extensive coverage and a relatively small overhead. Its main usages are postmortem analysis and finding the root cause(s) of performance glitches. Intel PT feature is covered in Appendix D. Processors based on ARM architecture also have a tracing capability called ARM [CoreSight](https://developer.arm.com/ip-products/system-ip/coresight-debug-and-trace),[^2] but it is mostly used for debugging rather than for performance analysis. | ||
|
||
Performance profilers leverage hardware features presented in this chapter to enable many different types of analysis. | ||
|
||
[^2]: ARM CoreSight - [https://developer.arm.com/ip-products/system-ip/coresight-debug-and-trace](https://developer.arm.com/ip-products/system-ip/coresight-debug-and-trace) | ||
|
||
\sectionbreak | ||
\sectionbreak |