Background
The perf-regression.yaml workflow (powered by github-action-benchmark) compares each run's benchmark result against a single cached previous sample. This approach was disabled in PR #7307 after it fired a false regression alert for BenchmarkRunEnumeration/Multiproto (432,036 allocs/op vs 201,337 allocs/op, ratio 2.15×), exceeding the alert threshold.
Root Cause
Single-sample cached comparisons are highly sensitive to noise:
RunEnumeration/Multiproto measured locally at 1,174,753 allocs/op (-benchtime=1x) vs 1,047,282 allocs/op (-benchtime=10x).
RunEnumeration/Default measured at ~54.4 M allocs/op (-benchtime=1x) vs ~25.6 M allocs/op (-benchtime=10x).
The metric is dominated by how many iterations the harness happens to run, setup/teardown overhead, and cached outliers — not real regressions in the code.
Proposed Solution
Replace the current single-sample approach with a statistically sound methodology:
- Collect multiple samples — run each benchmark suite with
-count=N (e.g., -count=10) on both the base commit and the head commit, storing the raw testing.B output.
- Analyse with
benchstat — use golang.org/x/perf/cmd/benchstat to compare the two sample sets. benchstat applies a statistical test (Welch's t-test by default) and reports a p-value alongside the delta, suppressing noise-driven alerts.
- Threshold on significance — only flag a regression when
benchstat reports a statistically significant change (e.g., p < 0.05) and the effect size exceeds a meaningful threshold (e.g., Δ > 10 %).
- CI integration — store per-commit sample files as GitHub Actions artifacts (or in a dedicated branch/cache), diff them on PRs, and post the
benchstat table as a PR comment.
Possible tooling options:
- benchstat (official Go toolchain)
- continuous-benchmark with a custom script that pre-aggregates samples
- A bespoke GitHub Actions workflow that builds, runs
-count=10, saves output to an artifact, then diffs against the previous run's artifact using benchstat
References
Background
The
perf-regression.yamlworkflow (powered by github-action-benchmark) compares each run's benchmark result against a single cached previous sample. This approach was disabled in PR #7307 after it fired a false regression alert forBenchmarkRunEnumeration/Multiproto(432,036allocs/op vs201,337allocs/op, ratio2.15×), exceeding the alert threshold.Root Cause
Single-sample cached comparisons are highly sensitive to noise:
RunEnumeration/Multiprotomeasured locally at 1,174,753 allocs/op (-benchtime=1x) vs 1,047,282 allocs/op (-benchtime=10x).RunEnumeration/Defaultmeasured at ~54.4 M allocs/op (-benchtime=1x) vs ~25.6 M allocs/op (-benchtime=10x).The metric is dominated by how many iterations the harness happens to run, setup/teardown overhead, and cached outliers — not real regressions in the code.
Proposed Solution
Replace the current single-sample approach with a statistically sound methodology:
-count=N(e.g.,-count=10) on both the base commit and the head commit, storing the rawtesting.Boutput.benchstat— use golang.org/x/perf/cmd/benchstat to compare the two sample sets.benchstatapplies a statistical test (Welch's t-test by default) and reports ap-value alongside the delta, suppressing noise-driven alerts.benchstatreports a statistically significant change (e.g.,p < 0.05) and the effect size exceeds a meaningful threshold (e.g.,Δ > 10 %).benchstattable as a PR comment.Possible tooling options:
-count=10, saves output to an artifact, then diffs against the previous run's artifact usingbenchstatReferences
.github/workflows/perf-regression.yamlbenchstatdocumentation: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat