Skip to content

Conversation

@jamessiddeley-amd
Copy link
Contributor

@jamessiddeley-amd jamessiddeley-amd commented Jan 9, 2026

Motivation

Multi-pass profiling collects counters across separate runs, causing derived metrics (such as A - B) to occasionally produce negative values due to run-to-run variance. Negative event counts are physically impossible and were appearing in L2 cache metrics, which was confusing users and giving invalid analysis results.

Technical Details

Implemented NOISE_CLAMP operation in src/utils/parser.py:

  • Always clamps negative values to 0 (eliminates impossible negative counts)
  • Warns only when relative error ≥ 1%. This draws a line between anomaly detection and normal hardware noise.
  • Single aggregated warning message instead of per-dispatch 'spam'
  • Applied to L2 cache metrics in 1700_l2_cache.yaml across all gfx9 architectures

Affected metrics utilizing new NOISE_CLAMP wrapper in soc yamls:

Metric ID Name
17.2.2 Remote Read Traffic
17.2.6 Remote Write Traffic
17.6.1 Read (64B)
17.6.5 Remote Read
17.6.6 Write and Atomic (32B)
17.6.10 Remote Write and Atomic

JIRA ID

[SWDEV-566417]

Test Plan

Added 6 pytest cases (@pytest.mark.noise_clamp):

  1. Core clamping behavior (scalar, Series, ndarray)
  2. Zero reference edge case (no division by zero)
  3. Warning triggers above 1% threshold
  4. No warning below 1% threshold
  5. Empty input handling
  6. Boundary condition (exactly 1%)
pytest tests/test_utils.py -m noise_clamp -v

Test Result

test_noise_clamp_clamping_behavior         PASSED
test_noise_clamp_zero_reference            PASSED
test_noise_clamp_warning_above_threshold   PASSED
test_noise_clamp_no_warning_below_threshold PASSED
test_noise_clamp_empty_input               PASSED
test_noise_clamp_threshold_boundary        PASSED

Validated on MI350 hardware and MI300X hardware with laplacian kernel workload (the workload in which the origional negative value tickets were filed). With this new implementation, negative L2 cache values were gone, and warning only appears for >1% deviations. At the time of this PR, there is no observed metric in MI350 or MI300X for laplacian kernel workload that surpasses the 1% relative threshold variance.

ctest suite currently running

Submission Checklist

@jamessiddeley-amd jamessiddeley-amd requested review from a team and prbasyal-amd as code owners January 9, 2026 17:06
@jamessiddeley-amd jamessiddeley-amd changed the title [rocprof-compute] Threshold Based Clamping in Analyze Stage [rocprofiler-compute] Threshold Based Clamping in Analyze Stage Jan 9, 2026
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jan 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants