Skip to content

[MAINTENANCE] replace single-sample cached benchmark comparisons with repeated-sample + benchstat analysis #7318

@coderabbitai

Description

@coderabbitai

Background

The perf-regression.yaml workflow (powered by github-action-benchmark) compares each run's benchmark result against a single cached previous sample. This approach was disabled in PR #7307 after it fired a false regression alert for BenchmarkRunEnumeration/Multiproto (432,036 allocs/op vs 201,337 allocs/op, ratio 2.15×), exceeding the alert threshold.

Root Cause

Single-sample cached comparisons are highly sensitive to noise:

  • RunEnumeration/Multiproto measured locally at 1,174,753 allocs/op (-benchtime=1x) vs 1,047,282 allocs/op (-benchtime=10x).
  • RunEnumeration/Default measured at ~54.4 M allocs/op (-benchtime=1x) vs ~25.6 M allocs/op (-benchtime=10x).

The metric is dominated by how many iterations the harness happens to run, setup/teardown overhead, and cached outliers — not real regressions in the code.

Proposed Solution

Replace the current single-sample approach with a statistically sound methodology:

  1. Collect multiple samples — run each benchmark suite with -count=N (e.g., -count=10) on both the base commit and the head commit, storing the raw testing.B output.
  2. Analyse with benchstat — use golang.org/x/perf/cmd/benchstat to compare the two sample sets. benchstat applies a statistical test (Welch's t-test by default) and reports a p-value alongside the delta, suppressing noise-driven alerts.
  3. Threshold on significance — only flag a regression when benchstat reports a statistically significant change (e.g., p < 0.05) and the effect size exceeds a meaningful threshold (e.g., Δ > 10 %).
  4. CI integration — store per-commit sample files as GitHub Actions artifacts (or in a dedicated branch/cache), diff them on PRs, and post the benchstat table as a PR comment.

Possible tooling options:

  • benchstat (official Go toolchain)
  • continuous-benchmark with a custom script that pre-aggregates samples
  • A bespoke GitHub Actions workflow that builds, runs -count=10, saves output to an artifact, then diffs against the previous run's artifact using benchstat

References

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions