Skip to content

feat(telemetry): consolidate quality metrics behind one flag-gated tap#19

Draft
TheTom wants to merge 5 commits into
devfrom
tom/feat/telemetry-quality-metrics
Draft

feat(telemetry): consolidate quality metrics behind one flag-gated tap#19
TheTom wants to merge 5 commits into
devfrom
tom/feat/telemetry-quality-metrics

Conversation

@TheTom
Copy link
Copy Markdown
Contributor

@TheTom TheTom commented Jun 3, 2026

Implements the design in #18 (planning/telemetry-quality-metrics-design.md).

Draft — still blocked on #17 for green CI. FFAI dev doesn't build against current metaltile: metaltile's SDPA kernel gained has_sink/sink_logit params and dev's Ops.swift call-sites haven't been synced (9 sink call-sites needed vs dev's 3). #17 carries the Ops.swift sync. So Build and test stays red on those pre-existing Ops.swift errors — none in the telemetry diff — until #17 merges; then this rebases clean and greens. The telemetry change type-checks cleanly in isolation (verified: a full module type-check surfaces 0 errors in Telemetry/, Benchmark/, Generation/Generate.swift, or FFAICLI/BenchCommand.swift — every error is in the pre-existing Ops.swift drift).

Rebased onto current dev (picks up the #18 design-doc merge).

What this does

  • QualityScorable (Telemetry/) — the single model-facing contract: makeScoringCaches + scoringForward (full next-token logits). Model conforms by delegating to its engine, so every family is scorable for free. Logits-not-logprobs keeps the math centralized and KLD apples-to-apples.
  • Perplexity.swift moved Stats/Telemetry/; compute / klDivergence re-typed to some QualityScorable. Argument labels kept → BenchRunner callers unchanged.
  • InspectTap gains QualityMetrics (.perplexity/.kld/.niah), parsed from FFAI_TELEMETRY=ppl,kld,…. Default [] ⇒ a disabled metric does zero work.

Stats/ keeps the non-quality runtime stats (GenerationStats, MemoryStats, ThinkingSplit).

Follow-ups — now implemented in this PR

All four deferred items from the design now land here:

  • Live-generation capture wiring (§5) — driveGeneration reads the single InspectTap; FFAI_TELEMETRY=ppl accumulates the model's self-perplexity over its own stream onto GenerationStats.genPerplexity. Default off: the hot path is byte-for-byte the existing fused-kernel sampler when the flag is unset (no logit readback, no softmax). .kld/.niah are bench-path metrics (no paired reference / retrieval harness live in decode).
  • PPL/KLD methodology (§6) — context windowing + stride (WindowPlan: n_ctx window strided by n_ctx/2, scores each window's second half, every token counted once, ≥ n_ctx/2 left-context); BOS handling; reference-logit disk cache (ReferenceLogitCache, two-phase KLD: --save-ref-logits dumps the f16 full-vocab reference once, --ref-logits scores candidates against it with no second model resident); distribution reporting (KLDistribution: mean/median/p90/p99/max + top-1 agreement %). contextWindow=0 preserves the legacy single-pass numbers.
  • NIAH (§5/§8) — Telemetry/NIAH.swift: needle buried at a known depth in a long filler haystack, recalled via greedy argmax over scoringForward (rides the same QualityScorable contract); swept across a (context-length × depth) grid; bench --method niah.
  • Release-pinned benches (§7) — BenchMethod.isQualityMetric; the CLI hard-refuses a quality bench on a DEBUG build (--allow-debug-bench to override for smoke tests); release-pinned make bench.

New CLI surface: --wikitext2-context, --save-ref-logits, --ref-logits, --allow-debug-bench.

Verification

@github-actions github-actions Bot added the feature New feature or capability label Jun 3, 2026
@TheTom TheTom force-pushed the tom/feat/telemetry-quality-metrics branch from c9d48fe to 8d84432 Compare June 3, 2026 21:42
@ekryski
Copy link
Copy Markdown
Collaborator

ekryski commented Jun 3, 2026

  • Live-generation capture wiring (consume the tap flags in the decode loop).
  • PPL/KLD methodology upgrades: context windowing + stride, reference-logit caching, distribution + top-1-agreement reporting.

Let's do these 2 things in this PR I think. Then we can call it done.

TheTom added 5 commits June 3, 2026 19:05
…move Perplexity to Telemetry/

Implements the quality-metrics consolidation design
(planning/telemetry-quality-metrics-design.md).

- QualityScorable (new, Telemetry/): the one model-facing contract —
  makeScoringCaches + scoringForward (returns full next-token logits).
  Model conforms by delegating to its engine, so every current family is
  scorable for free.
- Perplexity.swift moved Stats/ -> Telemetry/; compute/klDivergence re-typed
  from the concrete Model to `some QualityScorable` (argument labels kept, so
  callers are unchanged). The metric math is byte-for-byte identical.
- InspectTap gains a QualityMetrics OptionSet (.perplexity/.kld/.niah) parsed
  from FFAI_TELEMETRY=ppl,kld,...; default [] so a disabled metric does zero
  work. isCapturingMetrics / captures(_:) gate the (follow-up) live-capture path.

Stats/ keeps the non-quality runtime stats (GenerationStats, MemoryStats,
ThinkingSplit). NIAH, live-generation capture wiring, the PPL/KLD methodology
upgrades (windowing/stride, reference-logit caching, distribution reporting),
and the release-bench pin are follow-ups per the doc.
Per planning/telemetry-quality-metrics-design.md §7: forced-decode over a
corpus (wikitext2 PPL/KLD, niah) is far too slow in debug to publish. Add
BenchMethod.isQualityMetric, hard-refuse a quality bench on a DEBUG build
(--allow-debug-bench to override for smoke tests only), and a release-pinned
`make bench` target.
…stribution

Per planning/telemetry-quality-metrics-design.md §6:

- Context windowing + stride (§6.1): WindowPlan strides an n_ctx window by
  n_ctx/2 and scores only each window's second half, so every scored token
  carries >= n_ctx/2 of real left-context and is counted exactly once.
  contextWindow=0 keeps the legacy single-pass behaviour (back-compat).
- BOS / first-token handling (§6.2): prepend <bos> once for BOS-critical
  families before scoring the corpus.
- KLD reference-logit caching (§6.3): two-phase KLD via ReferenceLogitCache —
  dump the full-precision reference's per-position log-probs to disk (f16)
  once, then score each candidate against the file without co-loading a
  reference model. CLI: --save-ref-logits (phase A) / --ref-logits (phase B).
- Distribution reporting (§6.4): KLDistribution carries mean/median/p90/p99/max
  plus top-1 agreement %, printed by the bench; the report row keeps the mean.

Wired through BenchOptions / BenchRunner.runWikiText2 / BenchCommand
(--wikitext2-context). Adds pure-logic WindowPlanTests (coverage + percentiles).
Per planning/telemetry-quality-metrics-design.md §5/§8. Telemetry/NIAH.swift
buries a low-frequency needle fact at a known depth inside a long filler
haystack and asks the model to recall it, swept across a (context-length ×
depth) grid; reports recall accuracy. Rides the same QualityScorable
contract as PPL/KLD — the answer is greedy argmax over scoringForward
logits, no sampler, no per-family code (small Tokenizing/EOSProviding
capability protocols reach the tokenizer + EOS set; Model satisfies all).

Wired as bench --method niah (now isImplemented); BenchRunner.runNIAH prints
the grid + accuracy and stores the summary in the report row's preview.
Per planning/telemetry-quality-metrics-design.md §5. driveGeneration reads
the single InspectTap and, when FFAI_TELEMETRY=ppl is set, accumulates the
model's self-perplexity over its own stream (NLL of each chosen token) and
surfaces it on GenerationStats.genPerplexity. Default off: when the flag is
unset the hot path is byte-for-byte the existing fused-kernel sampler — no
logit readback, no softmax. .kld/.niah are documented as bench-path metrics
(no paired reference / retrieval harness in the live decode loop).
@TheTom TheTom force-pushed the tom/feat/telemetry-quality-metrics branch from 8d84432 to 71a8abc Compare June 4, 2026 00:23
TheTom added a commit that referenced this pull request Jun 4, 2026
…epo refs

- Move Quality/{KLDivergence,LogitsEmitter}.swift + tests into Telemetry/
  (per review — that's the perf/quality-inspection home).
- Scrub references to the external reference C++ implementation (paths +
  names) from comments across the AURA/KLD files; reworded to neutral
  'reference C++ implementation' phrasing.

Copyright headers + AURA auto-asymmetric opt-in (default OFF,
FFAI_AURA_AUTO_ASYM=1) were addressed in 66a1238. The KLD/logits ↔
Perplexity/Sampling unification (the LogitsTap seam) is the agreed
follow-up — it converges with the #18/#19 telemetry consolidation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature or capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants