feat(telemetry): consolidate quality metrics behind one flag-gated tap by TheTom · Pull Request #19 · thewafflehaus/FFAI

TheTom · 2026-06-03T21:37:00Z

Implements the design in #18 (planning/telemetry-quality-metrics-design.md).

Draft — still blocked on #17 for green CI. FFAI dev doesn't build against current metaltile: metaltile's SDPA kernel gained has_sink/sink_logit params and dev's Ops.swift call-sites haven't been synced (9 sink call-sites needed vs dev's 3). #17 carries the Ops.swift sync. So Build and test stays red on those pre-existing Ops.swift errors — none in the telemetry diff — until #17 merges; then this rebases clean and greens. The telemetry change type-checks cleanly in isolation (verified: a full module type-check surfaces 0 errors in Telemetry/, Benchmark/, Generation/Generate.swift, or FFAICLI/BenchCommand.swift — every error is in the pre-existing Ops.swift drift).

Rebased onto current dev (picks up the #18 design-doc merge).

What this does

QualityScorable (Telemetry/) — the single model-facing contract: makeScoringCaches + scoringForward (full next-token logits). Model conforms by delegating to its engine, so every family is scorable for free. Logits-not-logprobs keeps the math centralized and KLD apples-to-apples.
Perplexity.swift moved Stats/ → Telemetry/; compute / klDivergence re-typed to some QualityScorable. Argument labels kept → BenchRunner callers unchanged.
InspectTap gains QualityMetrics (.perplexity/.kld/.niah), parsed from FFAI_TELEMETRY=ppl,kld,…. Default [] ⇒ a disabled metric does zero work.

Stats/ keeps the non-quality runtime stats (GenerationStats, MemoryStats, ThinkingSplit).

Follow-ups — now implemented in this PR

All four deferred items from the design now land here:

Live-generation capture wiring (§5) — driveGeneration reads the single InspectTap; FFAI_TELEMETRY=ppl accumulates the model's self-perplexity over its own stream onto GenerationStats.genPerplexity. Default off: the hot path is byte-for-byte the existing fused-kernel sampler when the flag is unset (no logit readback, no softmax). .kld/.niah are bench-path metrics (no paired reference / retrieval harness live in decode).
PPL/KLD methodology (§6) — context windowing + stride (WindowPlan: n_ctx window strided by n_ctx/2, scores each window's second half, every token counted once, ≥ n_ctx/2 left-context); BOS handling; reference-logit disk cache (ReferenceLogitCache, two-phase KLD: --save-ref-logits dumps the f16 full-vocab reference once, --ref-logits scores candidates against it with no second model resident); distribution reporting (KLDistribution: mean/median/p90/p99/max + top-1 agreement %). contextWindow=0 preserves the legacy single-pass numbers.
NIAH (§5/§8) — Telemetry/NIAH.swift: needle buried at a known depth in a long filler haystack, recalled via greedy argmax over scoringForward (rides the same QualityScorable contract); swept across a (context-length × depth) grid; bench --method niah.
Release-pinned benches (§7) — BenchMethod.isQualityMetric; the CLI hard-refuses a quality bench on a DEBUG build (--allow-debug-bench to override for smoke tests); release-pinned make bench.

New CLI surface: --wikitext2-context, --save-ref-logits, --ref-logits, --allow-debug-bench.

Verification

Full-module type-check: 0 errors in the telemetry/bench diff (all build errors are the pre-existing Ops.swift sink drift carried by feat(gguf,dsv4): GGUF v3 reader + DeepSeek-V4-Flash GGUF loader & forward path #17).
New pure-logic WindowPlanTests (no GPU): window plan covers every corpus token exactly once across a spread of (n, ctx), ≥ n_ctx/2 left-context invariant, and KLDistribution percentile/top-1 math.
Perplexity per-position internals unchanged → PerplexityTests semantics preserved by construction.
End-to-end model runs (PPL/KLD numbers, NIAH recall) need a green build → land once feat(gguf,dsv4): GGUF v3 reader + DeepSeek-V4-Flash GGUF loader & forward path #17 merges and CI is unblocked.

ekryski · 2026-06-03T22:35:09Z

Live-generation capture wiring (consume the tap flags in the decode loop).

PPL/KLD methodology upgrades: context windowing + stride, reference-logit caching, distribution + top-1-agreement reporting.

Let's do these 2 things in this PR I think. Then we can call it done.

…move Perplexity to Telemetry/ Implements the quality-metrics consolidation design (planning/telemetry-quality-metrics-design.md). - QualityScorable (new, Telemetry/): the one model-facing contract — makeScoringCaches + scoringForward (returns full next-token logits). Model conforms by delegating to its engine, so every current family is scorable for free. - Perplexity.swift moved Stats/ -> Telemetry/; compute/klDivergence re-typed from the concrete Model to `some QualityScorable` (argument labels kept, so callers are unchanged). The metric math is byte-for-byte identical. - InspectTap gains a QualityMetrics OptionSet (.perplexity/.kld/.niah) parsed from FFAI_TELEMETRY=ppl,kld,...; default [] so a disabled metric does zero work. isCapturingMetrics / captures(_:) gate the (follow-up) live-capture path. Stats/ keeps the non-quality runtime stats (GenerationStats, MemoryStats, ThinkingSplit). NIAH, live-generation capture wiring, the PPL/KLD methodology upgrades (windowing/stride, reference-logit caching, distribution reporting), and the release-bench pin are follow-ups per the doc.

Per planning/telemetry-quality-metrics-design.md §7: forced-decode over a corpus (wikitext2 PPL/KLD, niah) is far too slow in debug to publish. Add BenchMethod.isQualityMetric, hard-refuse a quality bench on a DEBUG build (--allow-debug-bench to override for smoke tests only), and a release-pinned `make bench` target.

…stribution Per planning/telemetry-quality-metrics-design.md §6: - Context windowing + stride (§6.1): WindowPlan strides an n_ctx window by n_ctx/2 and scores only each window's second half, so every scored token carries >= n_ctx/2 of real left-context and is counted exactly once. contextWindow=0 keeps the legacy single-pass behaviour (back-compat). - BOS / first-token handling (§6.2): prepend <bos> once for BOS-critical families before scoring the corpus. - KLD reference-logit caching (§6.3): two-phase KLD via ReferenceLogitCache — dump the full-precision reference's per-position log-probs to disk (f16) once, then score each candidate against the file without co-loading a reference model. CLI: --save-ref-logits (phase A) / --ref-logits (phase B). - Distribution reporting (§6.4): KLDistribution carries mean/median/p90/p99/max plus top-1 agreement %, printed by the bench; the report row keeps the mean. Wired through BenchOptions / BenchRunner.runWikiText2 / BenchCommand (--wikitext2-context). Adds pure-logic WindowPlanTests (coverage + percentiles).

Per planning/telemetry-quality-metrics-design.md §5/§8. Telemetry/NIAH.swift buries a low-frequency needle fact at a known depth inside a long filler haystack and asks the model to recall it, swept across a (context-length × depth) grid; reports recall accuracy. Rides the same QualityScorable contract as PPL/KLD — the answer is greedy argmax over scoringForward logits, no sampler, no per-family code (small Tokenizing/EOSProviding capability protocols reach the tokenizer + EOS set; Model satisfies all). Wired as bench --method niah (now isImplemented); BenchRunner.runNIAH prints the grid + accuracy and stores the summary in the report row's preview.

Per planning/telemetry-quality-metrics-design.md §5. driveGeneration reads the single InspectTap and, when FFAI_TELEMETRY=ppl is set, accumulates the model's self-perplexity over its own stream (NLL of each chosen token) and surfaces it on GenerationStats.genPerplexity. Default off: when the flag is unset the hot path is byte-for-byte the existing fused-kernel sampler — no logit readback, no softmax. .kld/.niah are documented as bench-path metrics (no paired reference / retrieval harness in the live decode loop).

…epo refs - Move Quality/{KLDivergence,LogitsEmitter}.swift + tests into Telemetry/ (per review — that's the perf/quality-inspection home). - Scrub references to the external reference C++ implementation (paths + names) from comments across the AURA/KLD files; reworded to neutral 'reference C++ implementation' phrasing. Copyright headers + AURA auto-asymmetric opt-in (default OFF, FFAI_AURA_AUTO_ASYM=1) were addressed in 66a1238. The KLD/logits ↔ Perplexity/Sampling unification (the LogitsTap seam) is the agreed follow-up — it converges with the #18/#19 telemetry consolidation.

github-actions Bot added the feature New feature or capability label Jun 3, 2026

TheTom force-pushed the tom/feat/telemetry-quality-metrics branch from c9d48fe to 8d84432 Compare June 3, 2026 21:42

ekryski mentioned this pull request Jun 3, 2026

feat(gguf,dsv4): GGUF v3 reader + DeepSeek-V4-Flash GGUF loader & forward path #17

Open

TheTom added 5 commits June 3, 2026 19:05

TheTom force-pushed the tom/feat/telemetry-quality-metrics branch from 8d84432 to 71a8abc Compare June 4, 2026 00:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(telemetry): consolidate quality metrics behind one flag-gated tap#19

feat(telemetry): consolidate quality metrics behind one flag-gated tap#19
TheTom wants to merge 5 commits into
devfrom
tom/feat/telemetry-quality-metrics

TheTom commented Jun 3, 2026 •

edited

Loading

Uh oh!

ekryski commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TheTom commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

Follow-ups — now implemented in this PR

Verification

Uh oh!

ekryski commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TheTom commented Jun 3, 2026 •

edited

Loading

ekryski commented Jun 3, 2026 •

edited

Loading