feat(telemetry): consolidate quality metrics behind one flag-gated tap#19
Draft
TheTom wants to merge 5 commits into
Draft
feat(telemetry): consolidate quality metrics behind one flag-gated tap#19TheTom wants to merge 5 commits into
TheTom wants to merge 5 commits into
Conversation
c9d48fe to
8d84432
Compare
Collaborator
Let's do these 2 things in this PR I think. Then we can call it done. |
…move Perplexity to Telemetry/ Implements the quality-metrics consolidation design (planning/telemetry-quality-metrics-design.md). - QualityScorable (new, Telemetry/): the one model-facing contract — makeScoringCaches + scoringForward (returns full next-token logits). Model conforms by delegating to its engine, so every current family is scorable for free. - Perplexity.swift moved Stats/ -> Telemetry/; compute/klDivergence re-typed from the concrete Model to `some QualityScorable` (argument labels kept, so callers are unchanged). The metric math is byte-for-byte identical. - InspectTap gains a QualityMetrics OptionSet (.perplexity/.kld/.niah) parsed from FFAI_TELEMETRY=ppl,kld,...; default [] so a disabled metric does zero work. isCapturingMetrics / captures(_:) gate the (follow-up) live-capture path. Stats/ keeps the non-quality runtime stats (GenerationStats, MemoryStats, ThinkingSplit). NIAH, live-generation capture wiring, the PPL/KLD methodology upgrades (windowing/stride, reference-logit caching, distribution reporting), and the release-bench pin are follow-ups per the doc.
Per planning/telemetry-quality-metrics-design.md §7: forced-decode over a corpus (wikitext2 PPL/KLD, niah) is far too slow in debug to publish. Add BenchMethod.isQualityMetric, hard-refuse a quality bench on a DEBUG build (--allow-debug-bench to override for smoke tests only), and a release-pinned `make bench` target.
…stribution Per planning/telemetry-quality-metrics-design.md §6: - Context windowing + stride (§6.1): WindowPlan strides an n_ctx window by n_ctx/2 and scores only each window's second half, so every scored token carries >= n_ctx/2 of real left-context and is counted exactly once. contextWindow=0 keeps the legacy single-pass behaviour (back-compat). - BOS / first-token handling (§6.2): prepend <bos> once for BOS-critical families before scoring the corpus. - KLD reference-logit caching (§6.3): two-phase KLD via ReferenceLogitCache — dump the full-precision reference's per-position log-probs to disk (f16) once, then score each candidate against the file without co-loading a reference model. CLI: --save-ref-logits (phase A) / --ref-logits (phase B). - Distribution reporting (§6.4): KLDistribution carries mean/median/p90/p99/max plus top-1 agreement %, printed by the bench; the report row keeps the mean. Wired through BenchOptions / BenchRunner.runWikiText2 / BenchCommand (--wikitext2-context). Adds pure-logic WindowPlanTests (coverage + percentiles).
Per planning/telemetry-quality-metrics-design.md §5/§8. Telemetry/NIAH.swift buries a low-frequency needle fact at a known depth inside a long filler haystack and asks the model to recall it, swept across a (context-length × depth) grid; reports recall accuracy. Rides the same QualityScorable contract as PPL/KLD — the answer is greedy argmax over scoringForward logits, no sampler, no per-family code (small Tokenizing/EOSProviding capability protocols reach the tokenizer + EOS set; Model satisfies all). Wired as bench --method niah (now isImplemented); BenchRunner.runNIAH prints the grid + accuracy and stores the summary in the report row's preview.
Per planning/telemetry-quality-metrics-design.md §5. driveGeneration reads the single InspectTap and, when FFAI_TELEMETRY=ppl is set, accumulates the model's self-perplexity over its own stream (NLL of each chosen token) and surfaces it on GenerationStats.genPerplexity. Default off: when the flag is unset the hot path is byte-for-byte the existing fused-kernel sampler — no logit readback, no softmax. .kld/.niah are documented as bench-path metrics (no paired reference / retrieval harness in the live decode loop).
8d84432 to
71a8abc
Compare
TheTom
added a commit
that referenced
this pull request
Jun 4, 2026
…epo refs
- Move Quality/{KLDivergence,LogitsEmitter}.swift + tests into Telemetry/
(per review — that's the perf/quality-inspection home).
- Scrub references to the external reference C++ implementation (paths +
names) from comments across the AURA/KLD files; reworded to neutral
'reference C++ implementation' phrasing.
Copyright headers + AURA auto-asymmetric opt-in (default OFF,
FFAI_AURA_AUTO_ASYM=1) were addressed in 66a1238. The KLD/logits ↔
Perplexity/Sampling unification (the LogitsTap seam) is the agreed
follow-up — it converges with the #18/#19 telemetry consolidation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements the design in #18 (
planning/telemetry-quality-metrics-design.md).Rebased onto current
dev(picks up the #18 design-doc merge).What this does
QualityScorable(Telemetry/) — the single model-facing contract:makeScoringCaches+scoringForward(full next-token logits).Modelconforms by delegating to itsengine, so every family is scorable for free. Logits-not-logprobs keeps the math centralized and KLD apples-to-apples.Perplexity.swiftmovedStats/→Telemetry/;compute/klDivergencere-typed tosome QualityScorable. Argument labels kept →BenchRunnercallers unchanged.InspectTapgainsQualityMetrics(.perplexity/.kld/.niah), parsed fromFFAI_TELEMETRY=ppl,kld,…. Default[]⇒ a disabled metric does zero work.Stats/keeps the non-quality runtime stats (GenerationStats,MemoryStats,ThinkingSplit).Follow-ups — now implemented in this PR
All four deferred items from the design now land here:
driveGenerationreads the singleInspectTap;FFAI_TELEMETRY=pplaccumulates the model's self-perplexity over its own stream ontoGenerationStats.genPerplexity. Default off: the hot path is byte-for-byte the existing fused-kernel sampler when the flag is unset (no logit readback, no softmax)..kld/.niahare bench-path metrics (no paired reference / retrieval harness live in decode).WindowPlan:n_ctxwindow strided byn_ctx/2, scores each window's second half, every token counted once,≥ n_ctx/2left-context); BOS handling; reference-logit disk cache (ReferenceLogitCache, two-phase KLD:--save-ref-logitsdumps the f16 full-vocab reference once,--ref-logitsscores candidates against it with no second model resident); distribution reporting (KLDistribution: mean/median/p90/p99/max + top-1 agreement %).contextWindow=0preserves the legacy single-pass numbers.Telemetry/NIAH.swift: needle buried at a known depth in a long filler haystack, recalled via greedy argmax overscoringForward(rides the sameQualityScorablecontract); swept across a (context-length × depth) grid;bench --method niah.BenchMethod.isQualityMetric; the CLI hard-refuses a quality bench on aDEBUGbuild (--allow-debug-benchto override for smoke tests); release-pinnedmake bench.New CLI surface:
--wikitext2-context,--save-ref-logits,--ref-logits,--allow-debug-bench.Verification
Ops.swiftsink drift carried by feat(gguf,dsv4): GGUF v3 reader + DeepSeek-V4-Flash GGUF loader & forward path #17).WindowPlanTests(no GPU): window plan covers every corpus token exactly once across a spread of(n, ctx), ≥n_ctx/2left-context invariant, andKLDistributionpercentile/top-1 math.Perplexityper-position internals unchanged →PerplexityTestssemantics preserved by construction.