Summary
The perf-regression workflow was disabled in PR #7307 because the current benchmark CI setup produces false regression alerts.
The root cause is that the workflow relies on single-sample, cached comparisons via github-action-benchmark. The alloc metric is highly sensitive to how many iterations the harness happens to run:
BenchmarkRunEnumeration/Multiproto showed 1,174,753 allocs/op at -benchtime=1x but only 1,047,282 allocs/op at -benchtime=10x
BenchmarkRunEnumeration/Default dropped from ~54.4M allocs/op (-benchtime=1x) to ~25.6M (-benchtime=10x)
A single cached baseline sample can easily be contaminated by setup/teardown work, leading to misleading threshold breaches (e.g. the 2.15x ratio alert triggered for Multiproto - allocs/op).
Proposed Approach
Replace the single-sample approach with a statistically sound methodology:
- Multiple samples per run — use
-count=N (e.g., -count=6 or -count=10) so each CI run produces multiple data points per benchmark.
benchstat for comparison — use golang.org/x/perf/cmd/benchstat to compare the distributions from the base branch vs. the PR branch. benchstat applies a statistical test (Welch's t-test by default) to determine whether a difference is significant, dramatically reducing false positives.
- Stabilize the benchmark environment — optionally pin CPU affinity / disable frequency scaling on the CI runner, or use a dedicated self-hosted runner for benchmarks to reduce noise.
- Configurable threshold — expose a p-value threshold (e.g.,
alpha=0.05) and a minimum effect-size threshold (e.g., delta >= 10%) so only statistically significant and practically meaningful regressions produce alerts.
Example Workflow Sketch
- name: Run benchmarks (PR branch)
run: go test -run='^$' -bench=BenchmarkRunEnumeration -benchmem -count=6 ./... | tee bench-new.txt
- name: Run benchmarks (base branch)
run: |
git checkout ${{ github.base_ref }}
go test -run='^$' -bench=BenchmarkRunEnumeration -benchmem -count=6 ./... | tee bench-old.txt
- name: Compare with benchstat
run: |
go install golang.org/x/perf/cmd/benchstat@latest
benchstat bench-old.txt bench-new.txt
References
Summary
The
perf-regressionworkflow was disabled in PR #7307 because the current benchmark CI setup produces false regression alerts.The root cause is that the workflow relies on single-sample, cached comparisons via github-action-benchmark. The alloc metric is highly sensitive to how many iterations the harness happens to run:
BenchmarkRunEnumeration/Multiprotoshowed 1,174,753 allocs/op at-benchtime=1xbut only 1,047,282 allocs/op at-benchtime=10xBenchmarkRunEnumeration/Defaultdropped from ~54.4M allocs/op (-benchtime=1x) to ~25.6M (-benchtime=10x)A single cached baseline sample can easily be contaminated by setup/teardown work, leading to misleading threshold breaches (e.g. the
2.15xratio alert triggered forMultiproto - allocs/op).Proposed Approach
Replace the single-sample approach with a statistically sound methodology:
-count=N(e.g.,-count=6or-count=10) so each CI run produces multiple data points per benchmark.benchstatfor comparison — use golang.org/x/perf/cmd/benchstat to compare the distributions from the base branch vs. the PR branch.benchstatapplies a statistical test (Welch's t-test by default) to determine whether a difference is significant, dramatically reducing false positives.alpha=0.05) and a minimum effect-size threshold (e.g.,delta >= 10%) so only statistically significant and practically meaningful regressions produce alerts.Example Workflow Sketch
References
benchstatdocs: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat