Skip to content

feat(streaming): live transcription preview in HUD (Approach E)#59

Open
rdemeritt wants to merge 15 commits into
mainfrom
feat/streaming-transcription
Open

feat(streaming): live transcription preview in HUD (Approach E)#59
rdemeritt wants to merge 15 commits into
mainfrom
feat/streaming-transcription

Conversation

@rdemeritt
Copy link
Copy Markdown
Member

Summary

Adds live streaming transcription preview using on-device SFSpeechRecognizer while the user speaks. Final inject via the existing Whisper batch pipeline is unchanged. QA-approved across all three phases before merge.

Supersedes: PR #58 (closed when base branch was deleted after PR #57 merged).

What ships

P1 — Live preview MVP

  • AudioCaptureService: audioBufferPublisher emits pre-resample native-format buffers from the existing tap (both primary + device-change-restart sites)
  • SpeechRecognitionService (new): SFSpeechRecognizer with requiresOnDeviceRecognition = true; publishes partials; silently degrades if SR unavailable/unauthorized
  • TranscriptionPipeline: starts SR on startRecording(), cancels at top of stopAndTranscribe(), fires onPreviewClear after inject success and Whisper error
  • FloatingHUDView: caption strip below waveform (200×80 recording pill); "Listening…" placeholder
  • Info.plist: NSSpeechRecognitionUsageDescription

P2 — Configurable persistence

  • PreviewLingerMode enum with manual Codable + synthesized Hashable
  • SettingsManager.previewLingerMode, default .linger(seconds: 2)
  • Settings → Display: "After transcription completes" picker, 4 options, conditional on HUD enabled

P3 — Resilience + polish

  • SR auth check; .notDetermined triggers async system prompt, bails for session; all exits logged
  • Recognition task errors non-fatal — recording/inject unaffected
  • PipelineError.previewUnavailable(String).info severity
  • HalideTokens.fontCaption, "Listening…" placeholder, inject flash animation, accessibilityReduceMotion

QA gates

  • P1: 8/8 ✅ (1 fix: onPreviewClear on Whisper error path)
  • P2: 8/8 ✅
  • P3 final regression: 27/27 ✅
  • Build: debug + release clean; 361 tests, 0 failures; SwiftLint 0 errors

Test plan

  • Hold hotkey and speak — "Listening…" appears; partial text streams within ~300ms
  • Release — preview persists during Whisper processing, clears per linger setting
  • Settings → "After transcription completes" — all 4 options work; persists across relaunch
  • Deny Speech Recognition — recording and inject still work; no caption, no crash
  • HUD disabled — no caption visible
  • Long recording (>60s) — SR times out gracefully; Whisper inject unaffected

rdemeritt added 15 commits May 22, 2026 14:06
Emit the pre-resample native-format AVAudioPCMBuffer from both
installTap sites (primary + device-change-restart path). SpeechRecognitionService
subscribes to this publisher to feed SFSpeechAudioBufferRecognitionRequest,
which requires the hardware's native format rather than the 16 kHz resampled buffer.
New @mainactor service wrapping SFSpeechRecognizer. Streams partial
transcription strings while the user holds the hotkey. Silently degrades
when on-device recognition is unavailable so recording and Whisper injection
are unaffected. Includes requestAuthorization static helper.

stop() cancels the buffer subscription before endAudio() to prevent
append-after-endAudio crashes.
…blisher

- partialTranscriptPublisher (nonisolated PassthroughSubject) forwards SR partials to callers
- onPreviewClear callback fires after dispatchOutput completes (untilInjected semantics)
- SpeechRecognitionService starts after audioCapture.startCapture() and stops at top of stopAndTranscribe()
- srCancellable forwards SR partials to partialTranscriptPublisher; cancelled on hotkey release
- Add PreviewLingerMode enum and SettingsKey.previewLingerMode (P2 accessor deferred)
- FloatingHUDViewModel gains partialTranscript, showCaptionStrip,
  subscribeToPartials(), and clearPreview(); notifyRecordingStopped()
  no longer clears the transcript (pipeline fires onPreviewClear instead)
- WaveformHUDView grows from 56 pt to 80 pt when the caption strip is visible;
  Text row fades in with 0.15 s delay after pill expansion
- FloatingHUDWindow.recordingPanelHeight bumped to 80 pt
- FloatingHUDWindowController gains subscribeToPartials() and clearPreview() forwarders
…ge description

- NSSpeechRecognitionUsageDescription added to Info.plist
- main.swift wires pipeline.partialTranscriptPublisher → hud.subscribeToPartials()
  and pipeline.onPreviewClear → hud.clearPreview() after HUD controller setup
…ngs accessor

Manual Codable keeps JSON stable; Hashable enables SwiftUI Picker tags.
SettingsManager.previewLingerMode accessor defaults to .linger(seconds: 2).
- FloatingHUDViewModel: handlePreviewClear(mode:) with Task-based linger
  timer; clearPreview() cancels any in-flight timer and resets state;
  notifyRecordingStarted() calls clearPreview() to reset on new recording
- FloatingHUDWindowController: handlePreviewClear(mode:) forwarding method
- main.swift: onPreviewClear reads settingsManager.previewLingerMode live
  at inject time; no polling observer needed
- SettingsView: "After transcription completes" Picker inside Display
  section, visible only when HUD is enabled
- Request auth when .notDetermined; bail silently on first recording
- Log [SR] warnings on all guard exits (not authorized, unavailable, on-device unsupported)
- Handle recognition task error: log and cleanupTask() instead of propagating
- Add os.log import to satisfy FileLogger dependency chain
New case signals SR preview was unavailable this session without surfacing
to the user. Wired into errorDescription, severity (.info), and
isUserActionable (false). main.swift switch updated for exhaustiveness.
…pography tokens

DesignTokens:
- Add HalideTokens.fontCaption (11pt rounded regular)

FloatingHUDView:
- Caption strip always present during recording (height fixed at 80pt)
- "Listening..." placeholder shown when isRecording && !showCaptionStrip
- Placeholder uses HalideTokens.textTertiary; partial text uses textSecondary
- Caption text uses HalideTokens.fontCaption (replaces hardcoded .system call)
- captionHighlighted: Bool drives 400ms textPrimary flash on preview clear
- handlePreviewClear: .untilInjected and .linger both trigger flash before countdown
- Caption .animation respects @Environment(\.accessibilityReduceMotion)
- Remove redundant showCaptionStrip size animation binding
Replaces tail-truncation with a sliding window that shows the last 7 words
of the live partial transcript. A left-edge LinearGradient fade mask signals
that earlier words have scrolled off. Fade suppressed for fewer than 8 words
and when accessibilityReduceMotion is on.
`brew` and `swiftlint` aren't in the self-hosted runner's default PATH.
Use /opt/homebrew/bin/ prefix, consistent with the SQLCipher install step.
Frame-then-padding order was clipping the inset to zero — the 200pt text
frame filled the pill and `.padding(.horizontal, 10)` overflowed and was
clipped by the parent. Flip to padding-first so the text content area is
172pt (200 - 28pt) with equal 14pt margins inside the pill.
- sorted_imports: fix import order in SpeechRecognitionService.swift
- unneeded_break_in_switch: remove redundant break in errorHold case
- file_length: move VoiceCommand.displayLabel to VoiceCommand+UI.swift,
  bringing FloatingHUDWindow.swift from 404 to 389 lines
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant