-
Notifications
You must be signed in to change notification settings - Fork 7
#8 - Add support for multiple metrics in @nsight.analyze.kernel #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Alok-Joshi
merged 14 commits into
NVIDIA:main
from
ConvolutedDog:enhance-multiple-metrics
Dec 16, 2025
Merged
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
468c835
#8 - Add support for multiple metrics in @nsight.analyze.kernel
ConvolutedDog e177bd5
[Feat] Support multiple metrics and improve data aggregation
ConvolutedDog 53b6f09
fix lint
ConvolutedDog 74340a4
fix doc
ConvolutedDog 257acee
fix
ConvolutedDog 7c778cb
[Feat] Explode dataframe columns with list values
ConvolutedDog dc4c078
Revert "derive_metrics" to "derive_metric"
ConvolutedDog d2e85c8
Revert "derive_metrics" to "derive_metric"
ConvolutedDog 3652201
[RFC] Move explode_dataframe to extraction.py and fix a bug of Normal…
ConvolutedDog 17121db
[Test] Add normalize_against test for multiple metrics
ConvolutedDog 56ad954
add test and doc
ConvolutedDog 481ca2b
fix
ConvolutedDog f32cc7b
fix doc
ConvolutedDog 64df1e6
fix lint
ConvolutedDog File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,80 @@ | ||
| # Copyright 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| """ | ||
| Example 8: Collecting Multiple Metrics | ||
| ======================================= | ||
|
|
||
| This example shows how to collect multiple metrics in a single profiling run. | ||
|
|
||
| New concepts: | ||
| - Using the `metrics` parameter to collect multiple metrics | ||
| - `@nsight.analyze.plot` decorator does NOT support multiple metrics now | ||
| """ | ||
|
|
||
| import torch | ||
|
|
||
| import nsight | ||
|
|
||
| sizes = [(2**i,) for i in range(11, 13)] | ||
|
|
||
|
|
||
| @nsight.analyze.kernel( | ||
| configs=sizes, | ||
| runs=5, | ||
| # Collect both shared memory load and store SASS instructions | ||
| metrics=[ | ||
| "smsp__sass_inst_executed_op_shared_ld.sum", | ||
| "smsp__sass_inst_executed_op_shared_st.sum", | ||
| ], | ||
| ) | ||
| def analyze_shared_memory_ops(n: int) -> None: | ||
| """Analyze both shared memory load and store SASS instructions | ||
| for different kernels. | ||
|
|
||
| Note: To evaluate multiple metrics, pass them as a sequence | ||
| (list/tuple). All results are merged into one ProfileResults | ||
| object, with the 'Metric' column indicating each specific metric. | ||
| """ | ||
|
|
||
| a = torch.randn(n, n, device="cuda") | ||
| b = torch.randn(n, n, device="cuda") | ||
| c = torch.randn(2 * n, 2 * n, device="cuda") | ||
| d = torch.randn(2 * n, 2 * n, device="cuda") | ||
|
|
||
| with nsight.annotate("@-operator"): | ||
| _ = a @ b | ||
|
|
||
| with nsight.annotate("torch.matmul"): | ||
| _ = torch.matmul(c, d) | ||
|
|
||
|
|
||
| def main() -> None: | ||
| # Run analysis with multiple metrics | ||
| results = analyze_shared_memory_ops() | ||
|
|
||
| df = results.to_dataframe() | ||
| print(df) | ||
|
|
||
| unique_metrics = df["Metric"].unique() | ||
| print(f"\n✓ Collected {len(unique_metrics)} metrics:") | ||
| for metric in unique_metrics: | ||
| print(f" - {metric}") | ||
|
|
||
| print("\n✓ Sample data:") | ||
| print(df[["Annotation", "n", "Metric", "AvgValue"]].to_string(index=False)) | ||
|
|
||
| print("\n" + "=" * 60) | ||
| print("IMPORTANT: @plot decorator limitation") | ||
| print("=" * 60) | ||
| print("When multiple metrics are collected:") | ||
| print(" ✓ All metrics are collected in a single ProfileResults object") | ||
| print(" ✓ DataFrame has 'Metric' column to distinguish them") | ||
| print(" ✗ @nsight.analyze.plot decorator will RAISE AN ERROR") | ||
| print(" Why? @plot can only visualize one metric at a time.") | ||
| print(" Tip: Use separate @kernel functions for each metric or use") | ||
| print(" 'derive_metric' to compute custom values.") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,87 @@ | ||
| # Copyright 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| """ | ||
| Example 9: Advanced Custom Metrics from Multiple Metrics | ||
| ========================================================= | ||
|
|
||
| This example shows how to compute custom metrics from multiple metrics. | ||
|
|
||
| New concepts: | ||
| - Using `derive_metric` to compute custom values from multiple metrics | ||
| """ | ||
|
|
||
| import torch | ||
|
|
||
| import nsight | ||
|
|
||
| sizes = [(2**i,) for i in range(10, 13)] | ||
|
|
||
|
|
||
| def compute_avg_insts( | ||
| ld_insts: int, st_insts: int, launch_sm_count: int, n: int | ||
| ) -> float: | ||
| """ | ||
| Compute average shared memory load/store instructions per SM. | ||
|
|
||
| Custom metric function signature: | ||
| - First several arguments: the measured metrics, must match the order | ||
| of metrics in @kernel decorator | ||
| - Remaining arguments: must match the decorated function's signature | ||
|
|
||
| In this example: | ||
| - ld_insts: Total shared memory load instructions | ||
| (from smsp__inst_executed_pipe_lsu.shared_op_ld.sum metric) | ||
| - st_insts: Total shared memory store instructions | ||
| (from smsp__inst_executed_pipe_lsu.shared_op_st.sum metric) | ||
| - launch_sm_count: Number of SMs that launched blocks | ||
| (from launch__block_sm_count metric) | ||
| - n: Matches the 'n' parameter from benchmark_avg_insts(n) | ||
|
|
||
| Args: | ||
| ld_insts: Total shared memory load instructions | ||
| st_insts: Total shared memory store instructions | ||
| launch_sm_count: Number of SMs that launched blocks | ||
| n: Matrix size (n x n) - parameter from the decorated benchmark function | ||
|
|
||
| Returns: | ||
| Average shared memory load/store instructions per SM | ||
| """ | ||
| insts_per_sm = (ld_insts + st_insts) / launch_sm_count | ||
| return insts_per_sm | ||
|
|
||
|
|
||
| @nsight.analyze.plot( | ||
| filename="09_advanced_metric_custom.png", | ||
| ylabel="Average Shared Memory Load/Store Instructions per SM", # Custom y-axis label | ||
| annotate_points=True, # Show values on the plot | ||
| ) | ||
| @nsight.analyze.kernel( | ||
| configs=sizes, | ||
| runs=10, | ||
| derive_metric=compute_avg_insts, # Use custom metric | ||
| metrics=[ | ||
| "smsp__sass_inst_executed_op_shared_ld.sum", | ||
| "smsp__sass_inst_executed_op_shared_st.sum", | ||
| "launch__sm_count", | ||
| ], | ||
| ) | ||
| def benchmark_avg_insts(n: int) -> None: | ||
| """ | ||
| Benchmark matmul and display results. | ||
| """ | ||
| a = torch.randn(n, n, device="cuda") | ||
| b = torch.randn(n, n, device="cuda") | ||
|
|
||
| with nsight.annotate("matmul"): | ||
| _ = a @ b | ||
|
|
||
|
|
||
| def main() -> None: | ||
| result = benchmark_avg_insts() | ||
| print(result.to_dataframe()) | ||
| print("✓ Avg Insts benchmark complete! Check '09_advanced_metric_custom.png'") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,63 @@ | ||
| # Copyright 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| """ | ||
| Example 10: Multiple Kernels per Run with Combined Metrics | ||
| =========================================================== | ||
|
|
||
| This example shows how to profile multiple kernels in a single run and combine their metrics. | ||
|
|
||
| New concepts: | ||
| - Using `combine_kernel_metrics` to aggregate metrics from multiple kernels | ||
| - Summing metrics from consecutive kernel executions | ||
| """ | ||
|
|
||
| import torch | ||
|
|
||
| import nsight | ||
|
|
||
| # Define configuration sizes | ||
| sizes = [(2**i,) for i in range(10, 13)] | ||
|
|
||
|
|
||
| @nsight.analyze.plot( | ||
| filename="10_combine_kernel_metrics.png", | ||
| ylabel="Total Cycles (Sum of 3 Kernels)", | ||
| annotate_points=True, | ||
| ) | ||
| @nsight.analyze.kernel( | ||
| configs=sizes, | ||
| runs=7, | ||
| combine_kernel_metrics=lambda x, y: x + y, # Sum metrics from multiple kernels | ||
| metrics=[ | ||
| "sm__cycles_elapsed.avg", | ||
| ], | ||
| ) | ||
| def benchmark_multiple_kernels(n: int) -> None: | ||
| """ | ||
| Benchmark three matrix multiplications in a single run. | ||
|
|
||
| Executes three matmul operations within one profiled context, | ||
| demonstrating metric combination across kernels. | ||
|
|
||
| Args: | ||
| n: Matrix size (n x n) | ||
| """ | ||
| a = torch.randn(n, n, device="cuda") | ||
| b = torch.randn(n, n, device="cuda") | ||
|
|
||
| with nsight.annotate("test"): | ||
| # Three consecutive kernel executions | ||
| _ = a @ b # Kernel 1 | ||
| _ = a @ b # Kernel 2 | ||
| _ = a @ b # Kernel 3 | ||
|
|
||
|
|
||
| def main() -> None: | ||
| result = benchmark_multiple_kernels() | ||
| print(result.to_dataframe()) | ||
| print("\n✓ Total Cycles benchmark complete! Check '10_combine_kernel_metrics.png'") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.