Add changepoint detection algorithms and eval infrastructure#47683
Add changepoint detection algorithms and eval infrastructure#47683ellataira wants to merge 24 commits intoq-branch-observerfrom
Conversation
Introduce a standalone scoring system that evaluates observer anomaly detection against ground truth disruption timestamps using Gaussian overlap. The scorer is fully decoupled from the testbench. - comp/observer/impl/score.go: Gaussian F1 scoring with half-Gaussian overlap, warmup filtering (before baseline.start), and cascading filtering (beyond 2σ after onset). Infers ground truth from scenario metadata.json or accepts explicit timestamps. - cmd/observer-scorer/main.go: standalone scorer binary - comp/observer/impl/eval_test.go: integration test that runs all benchmark scenarios headless and prints a score summary table. Three scenarios with hardcoded ground truths. Run with: go test -run TestEval -v ./comp/observer/impl/ - comp/observer/impl/score_test.go: 15 unit tests for scoring logic - comp/anomalydetection/recorder/impl/recorder.go: add NewReadOnlyRecorder() for test use without fx - cmd/observer-testbench/main.go: remove scoring flags and logic - scenarios/*/metadata.json: ground truth metadata for three scenarios Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- tasks/q.py: add q.build-scorer and q.eval invoke tasks Usage: dda inv q.eval / dda inv q.eval --scenario 213_pagerduty - eval_test.go: add Baseline FPs column, use text/tabwriter for table
Replace eval_test.go with invoke-based eval that runs the actual testbench and scorer binaries. This tests the real code path (CLI, fx, binary entry points) and leaves inspectable output JSONs in /tmp. - tasks/q.py: rewrite q.eval to build+run binaries, collect JSON scores, print summary table with baseline FP counts - cmd/observer-scorer/main.go: add --json flag for machine-readable output - Delete eval_test.go and NewReadOnlyRecorder() (no longer needed) Usage: dda inv q.eval dda inv q.eval --scenario 213_pagerduty dda inv q.eval --sigma 15
Align all defaults to ./comp/observer/scenarios, matching the convention from the headless mode PR. Updates observer-testbench, observer-scorer, and invoke tasks.
…gepoint-detection-algorithms
… truth, detector eval task
- DetectorPassthroughCorrelator: emits one ActiveCorrelation per anomaly
(no clustering), enabling per-detector scoring via the existing scorer
- Extended metadata.json for all 3 scenarios with true_positives and
false_positives (service + metric_name pairs) from scenario YAMLs
- New `dda inv q.eval-detectors` task: runs each detector x scenario x
correlator (passthrough L1 + time_cluster L2), prints comparison matrix
- Registered passthrough correlator in testbench registry (default disabled)
New score_metrics.go (detached from score.go for easy cherry-pick):
- LoadMetricGroundTruth: reads true_positives/false_positives from metadata.json
- ScoreMetrics: classifies each anomaly period's metric as TP, FP, or unknown
by matching anomaly Source against service:metric ground truth pairs
- Handles aggregate suffixes (e.g., "redis.cpu.sys:avg" matches "redis.cpu.sys")
- Reports metric-level precision, recall, F1, plus found/missed TP lists
Wiring:
- observer-scorer: new --score-metrics flag, outputs metrics in JSON
- q.eval-detectors: L1 runs use --verbose + --score-metrics, table shows
mTP/mFP/mUnk/mPrec/mRec columns
Implement Mann-Whitney, TopK, Correlation, PELT, E-Divisive, Hardened CUSUM, Ensemble, and Cusum_adaptive detectors. Add adapterMetricID() source field fix for metric scoring, service-level matching fallback in scorer, and eval harness updates. All detectors registered (new ones default disabled). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dened, cusum_adaptive Remove 5 detectors that plateaued or failed during eval iteration, keeping the 3 promising algorithms (mannwhitney, topk, correlation) alongside the existing baseline detectors (cusum, bocpd, rrcf). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… truth, detector eval task
- DetectorPassthroughCorrelator: emits one ActiveCorrelation per anomaly
(no clustering), enabling per-detector scoring via the existing scorer
- Extended metadata.json for all 3 scenarios with true_positives and
false_positives (service + metric_name pairs) from scenario YAMLs
- New `dda inv q.eval-detectors` task: runs each detector x scenario x
correlator (passthrough L1 + time_cluster L2), prints comparison matrix
- Registered passthrough correlator in testbench registry (default disabled)
New score_metrics.go (detached from score.go for easy cherry-pick):
- LoadMetricGroundTruth: reads true_positives/false_positives from metadata.json
- ScoreMetrics: classifies each anomaly period's metric as TP, FP, or unknown
by matching anomaly Source against service:metric ground truth pairs
- Handles aggregate suffixes (e.g., "redis.cpu.sys:avg" matches "redis.cpu.sys")
- Reports metric-level precision, recall, F1, plus found/missed TP lists
Wiring:
- observer-scorer: new --score-metrics flag, outputs metrics in JSON
- q.eval-detectors: L1 runs use --verbose + --score-metrics, table shows
mTP/mFP/mUnk/mPrec/mRec columns
…taDog/datadog-agent into ella/changepoint-detection-algorithms # Conflicts: # cmd/observer-scorer/main.go # comp/observer/impl/score.go # comp/observer/impl/score_metrics.go # tasks/q.py
Enrich MetricScoreResult with per-metric detection detail: first-seen timestamps, detection counts, and delta from disruption start. Surface in scorer JSON/text output and eval-detectors task. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix gofmt issues in testbench_registry.go and score_metrics.go - Extract duplicated median/MAD/seriesLabel/metricID/service helpers into metrics_detector_util.go, replacing prefixed copies in each detector - detectorMAD takes scaleToSigma bool to document the intentional difference: MW scales (σ comparison), TopK does not (raw denominator) - Remove unused _count_baseline_fps from tasks/q.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflicts with upstream interface renames (PR #47485): - observer.go, testbench.go: take upstream versions with new type names - Update detector files to use DetectionResult (was MetricsDetectionResult/MultiSeriesDetectionResult) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Split q.py eval_detectors output into two clean tables: L2 timestamp detection (TimeCluster, Gaussian F1) and L1 per-metric scoring (Passthrough, mPrec/mRec/mF1). L1 no longer redundantly computes timestamp F1. - Add detectorSampleStddev to shared utils; use shared helpers in Mann-Whitney detector - Bound corrshift firedSeries map to prevent unbounded memory growth - gofmt alignment fixes in output.go, rrcf.go, time_cluster.go, score_metrics_test.go Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b3297787c9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
comp/observer/impl/score_metrics.go
Outdated
| } | ||
| } | ||
|
|
||
| return sourceName == metric || strings.Contains(sourceName, metric) |
There was a problem hiding this comment.
Match service when classifying metric detections
metricMatches ignores the service portion of each service:metric ground-truth key and then accepts substring metric matches, so a source like dispatch-service:trace.http.request.hits can be counted against TP keys for a different service, and trace.http.request.errors can match trace.http.request. Since TP matching runs before FP matching, this can systematically inflate TP counts and distort L1 precision/recall (including non-deterministic outcomes when multiple keys match).
Useful? React with 👍 / 👎.
Go Package Import DifferencesBaseline: 681899c
|
- Revert tasks/q.py to upstream (remove eval_detectors task)
- Remove scenario metadata.json ground truth files
- Add metadata.json format documentation with JSON example to
LoadMetricGroundTruth for future integration
- Document passthrough correlator output format (ActiveCorrelations):
pattern naming, title format, one anomaly per correlation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c8aa3baf43
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if d.fired[fireKey] { | ||
| continue |
There was a problem hiding this comment.
Reset TopK fired cache before rerunning detectors
This detector keeps a process-wide fired map and drops any metric that has fired before, but it does not implement Reset(), while the test bench only clears state for components that expose Reset (comp/observer/impl/testbench.go, resetAllState). After one scenario load/rerun, a second rerun in the same process will skip previously fired metric IDs at this check and can emit no TopK anomalies, which corrupts repeated interactive evaluations and component-toggle experiments.
Useful? React with 👍 / 👎.
| if d.firedSeries[fireKey] { | ||
| continue | ||
| } | ||
| d.firedSeries[fireKey] = true |
There was a problem hiding this comment.
Clear CorrShift dedup state across scenario reruns
CorrShift deduplicates emissions via firedSeries and accumulates rolling norms in recentNorms, but the type provides no Reset() even though reruns rely on reset-capable components to clear prior state (comp/observer/impl/testbench.go, resetAllState). Replaying another scenario (or rerunning the same one) in the same process reuses old dedup keys and history, suppressing expected anomalies and shifting thresholds with stale data.
Useful? React with 👍 / 👎.
comp/observer/impl/score_metrics.go
Outdated
| if source == "" { | ||
| result.UnknownCount++ | ||
| result.UnknownDetectionCount++ | ||
| continue |
There was a problem hiding this comment.
Reject metric scoring when anomaly sources are unavailable
Metric scoring silently treats missing per-period metric sources as unknown instead of failing, so default non-verbose headless output (which omits nested anomalies/title fields) can produce valid-looking TP/FP zeros and misleading mPrec/mRec. In this branch, an empty source is accepted and counted as unknown here, which can make detector comparisons incorrect unless callers manually remember to generate verbose output.
Useful? React with 👍 / 👎.
Static quality checks❌ Please find below the results from static quality gates Error
Gate failure full details
Static quality gates prevent the PR to merge! Successful checksInfo
On-wire sizes (compressed)
|
…rithms-v2 # Conflicts: # comp/observer/impl/testbench_registry.go
…etector Batch-mode SeriesDetector re-runs from scratch every tick, which is too expensive under unified scheduling. This converts MannWhitneyDetector to the streaming Detector interface (same pattern as BOCPD in #47739): - Per-series state with cursor-based incremental reads - Fixed baseline from warmup, sliding circular buffer for recent window - Alert lifecycle with hysteresis and recovery - All 4 original filters preserved (significance, effect size, deviation, relative change) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Adds 3 new changepoint detection algorithms (Mann-Whitney, TopK, CorrShift), a passthrough correlator, and shared detector utilities.
New Detector Files
comp/observer/impl/metrics_detector_mannwhitney.go(368 lines,SeriesDetector) — Slides a split point across a time series and picks the candidate where a rank-based Mann-Whitney U test shows the strongest before/after shift. Uses 5 layered filters (p-value < 1e-12, effect size > 0.95, deviation > 3 MADs, relative change > 20%, minimum window of 60 points).comp/observer/impl/metrics_detector_topk.go(367 lines,Detector) — Ranks all metrics by|post_median - pre_median| / MADand reports only the top-K (min of 20 and top 2%). Includes a service diversity bonus so infrastructure metrics don't crowd out all slots.comp/observer/impl/metrics_detector_corrshift.go(711 lines,Detector) — Computes rolling correlation matrices over the top-40-by-variance series, flags when the Frobenius norm of the correlation delta exceedsmean + 2*stddev. Detects cascading failures where previously independent metrics suddenly correlate. Renamed fromcorrelationin commitbdd09f9b69.New Correlator
comp/observer/impl/anomaly_correlator_passthrough.go(104 lines) — BypassesTimeClustergrouping; emits each anomaly as its own 1-member correlation so downstream consumers see exactly which metrics each detector fires on.comp/observer/impl/anomaly_correlator_passthrough_test.go(93 lines) — Unit tests covering single-anomaly, multi-anomaly, and empty-input cases.Shared Utilities
comp/observer/impl/metrics_detector_util.go(126 lines) — Extracted from duplicate code across detectors:detectorMedian,detectorMAD,detectorMeanValues,detectorSampleStddev,detectorSeriesLabel,detectorMetricID,detectorService,detectorHasServiceTag.Registry Changes
comp/observer/impl/testbench_registry.go(+36 lines) — Registersmannwhitney,corrshift,topkdetectors andpassthroughcorrelator, all withDefaultEnabled: false.Formatting / Cleanup
anomaly_correlator_time_cluster.go,metrics_detector_rrcf.go,output.go—gofmtalignment fixes only.Test Plan
anomaly_correlator_passthrough_test.gocovers passthrough correlator behaviorDefaultEnabled: falsecmd/observer-scorergofmtclean