feat: add voice mode for chat input#1159
Conversation
Adds a voice input button to the chat composer with three transcription engines: in-browser Web Speech API, in-browser Whisper via @huggingface/transformers, and a server-side OpenAI Whisper fallback gated on KANDEV_VOICE_OPENAI_API_KEY. User-configurable from a new /settings/voice-mode page: - Master on/off toggle (defaults to on; opt-out, not opt-in) - Engine preference (auto / Web Speech / Whisper Web / Whisper Server) - Language hint (auto-detect or BCP-47 pinned) - Activation mode (click-to-toggle or hold-to-talk) - Auto-send after transcription - Whisper Web model size (tiny / base / small) - Editable Cmd+Shift+M keyboard shortcut Voice settings live under user_settings.voice_mode as a nested JSON object — no schema migration. Setting changes broadcast via the existing user.settings.updated WS event so any open chat tab updates live. The mic button is rendered on the right of the composer alongside the send button (round primary fill; destructive fill with pulse ring while recording). On browsers without a usable engine (e.g. iOS Chrome over HTTP) the button stays visible as a muted greyed icon — tapping it surfaces a toast with the reason (typically a missing secure context). Whisper Web runs in a Web Worker with fp16 decoder weights (avoids the MatMulNBits scale-missing crash that q4/q8 hit on WASM). Models are pulled from the onnx-community/* mirror and cached by transformers.js in the Cache Storage on first use.
📝 WalkthroughWalkthroughAdds full-stack voice input: server-side transcription fallback (OpenAI Whisper), client engines (Web Speech, Whisper Web, Whisper Server), user settings UI and persistence, chat composer integration with transcript insertion and optional auto-send, keyboard shortcut, worker for in-browser Whisper, and tests across backend and frontend. ChangesVoice Input & Transcription Feature
Sequence DiagramssequenceDiagram
participant User
participant Chat as Chat Composer
participant Button as Voice Button
participant Hook as useVoiceInput
participant WebSpeech as Web Speech API
participant Server as Backend /api/v1/transcribe
participant OpenAI as OpenAI Whisper
User->>Chat: press voice shortcut or click voice button
Chat->>Button: trigger onVoiceTranscript/onAutoSend
Button->>Hook: start()
Hook->>WebSpeech: SpeechRecognition.start() or beginCapture()
WebSpeech-->>Hook: onresult with transcript
Hook->>Chat: onTranscript("hello world")
Chat->>Chat: insert at cursor
Hook->>Server: optional transcribe via server (capture path)
Server->>OpenAI: multipart upload
OpenAI-->>Server: {text: ...}
Server-->>Hook: {text: ...}
Hook-->>Chat: onTranscript("...from server")
sequenceDiagram
participant Browser
participant Worker as Whisper Web Worker
participant HF as Hugging Face CDN
Browser->>Worker: {type: 'init', model: 'base'}
Worker->>HF: fetch Transformers model files
HF-->>Worker: model bytes, emit progress
Worker-->>Browser: {type: 'ready'}
Browser->>Worker: {type: 'transcribe', audio: Float32Array}
Worker->>Worker: pipeline.transcribe(audio)
Worker-->>Browser: {type: 'result', text: '...'}
🎯 4 (Complex) | ⏱️ ~60 minutes Suggested Labels
Suggested Reviewers
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
|
Claude finished @jcfs's task in 7m 46s —— View job Code Review — feat: add voice mode for chat input
FindingsBlocker (must fix before merge)
Suggestion (recommended, doesn't block)
Summary
Verdict: Blocked — fix blockers first. The backend work (transcription service, config, user settings persistence, validation) is well-structured: the |
|
| Filename | Overview |
|---|---|
| apps/backend/internal/voice/handlers/transcribe_handlers.go | New HTTP handler for POST /api/v1/transcribe; size limiting via MaxBytesReader with correct 413 handling, 503 short-circuit when unconfigured. |
| apps/backend/internal/voice/transcribe/service.go | OpenAI Whisper proxy; well-typed error handling (ErrNotConfigured, UpstreamError), MIME→extension mapping, and functional-options pattern for testability. |
| apps/backend/internal/user/store/sqlite.go | Adds voice_mode persistence using storedVoiceMode (*bool for enabled) to correctly handle pre-existing rows that predate the column. |
| apps/web/hooks/use-voice-input.ts | Core voice hook; driver abstraction (webSpeech / capture) is clean. whisperErrorMessage always returns code "model-load" for any Whisper error, including runtime transcription failures. |
| apps/web/lib/voice/whisper-web-client.ts | Robust Worker client with single-inflight guard, stale-worker error protection, and clean audio resample path. |
| apps/web/workers/whisper-web.worker.ts | transformers.js worker using fp16 decoder dtype to avoid MatMulNBits crash on WASM; model reuse guard prevents redundant downloads. |
| apps/web/lib/voice/whisper-web-models.ts | Model ID registry for onnx-community mirrors; contains a stale comment saying "q8 variant" while the worker actually uses fp16. |
| apps/web/components/task/chat/voice-input-button.tsx | Toggle/hold activation, keyboard shortcut binding, auto-send via rAF, and accessibility labels are all well-implemented. |
| apps/web/components/settings/voice-mode-settings.tsx | Settings page with optimistic updates, key-scoped rollback on failure, and live capability detection; correctly uses storeApi to avoid stale closures. |
| apps/pnpm-lock.yaml | Adds @huggingface/transformers@4.2.0 which pulls dev-build snapshots of onnxruntime-web and onnxruntime-common; also surfaces a CWE-502 deprecation on @ungap/structured-clone@1.3.0. |
Sequence Diagram
sequenceDiagram
participant User
participant VoiceInputButton
participant useVoiceInput
participant WebSpeechAPI
participant WhisperWebClient
participant Worker as whisper-web.worker
participant BackendAPI as /api/v1/transcribe
participant OpenAI as OpenAI Whisper
User->>VoiceInputButton: click / hold / shortcut
VoiceInputButton->>useVoiceInput: start()
useVoiceInput->>useVoiceInput: resolveActiveEngine()
alt "engine = webSpeech"
useVoiceInput->>WebSpeechAPI: recognition.start()
WebSpeechAPI-->>useVoiceInput: onresult (transcript)
useVoiceInput-->>VoiceInputButton: onTranscript(text)
else "engine = whisperWeb"
useVoiceInput->>useVoiceInput: getUserMedia() → MediaRecorder.start()
User->>VoiceInputButton: stop
useVoiceInput->>WhisperWebClient: transcribe(blob)
WhisperWebClient->>WhisperWebClient: blobToWhisperFloat32()
WhisperWebClient->>Worker: "postMessage({type:transcribe, audio})"
Worker->>Worker: asrPipeline(audio)
Worker-->>WhisperWebClient: "postMessage({type:result, text})"
WhisperWebClient-->>useVoiceInput: text
useVoiceInput-->>VoiceInputButton: onTranscript(text)
else "engine = whisperServer"
useVoiceInput->>useVoiceInput: getUserMedia() → MediaRecorder.start()
User->>VoiceInputButton: stop
useVoiceInput->>BackendAPI: POST /api/v1/transcribe
BackendAPI->>OpenAI: POST /v1/audio/transcriptions
OpenAI-->>BackendAPI: "{text}"
BackendAPI-->>useVoiceInput: "{text}"
useVoiceInput-->>VoiceInputButton: onTranscript(text)
end
VoiceInputButton->>VoiceInputButton: insertText into editor
opt auto-send enabled
VoiceInputButton->>VoiceInputButton: rAF → handleVoiceAutoSend()
end
Reviews (4): Last reviewed commit: "fix: voice mode progress scale, hold-mod..." | Re-trigger Greptile
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
apps/web/lib/voice/whisper-web-models.ts (1)
5-5: ⚡ Quick winStale
modelIdcomment contradicts the onnx-community rationale below.This doc says the
Xenova/*mirrors are used, but every entry usesonnx-community/*and lines 13-17 explain Xenova was deliberately avoided (MatMulNBits/WASM failures). Leaving this could prompt a regression back to Xenova.📝 Align the comment with the actual mirror
- /** Hugging Face model id (the Xenova/* mirrors are pre-quantized for transformers.js). */ + /** Hugging Face model id (onnx-community/* mirrors; see note below on why not Xenova/*). */ modelId: string;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/web/lib/voice/whisper-web-models.ts` at line 5, Update the stale top-of-file comment that references "Xenova/*" so it correctly describes the actual mirror used (onnx-community/*) and the rationale for avoiding Xenova; locate the comment above the model list (the `modelId` doc comment near the `WHISPER_MODELS` entries) and replace it with a brief note that entries use `onnx-community/*` and mention the MatMulNBits/WASM failures as the reason Xenova mirrors were not used to prevent accidental regression.apps/backend/internal/user/store/sqlite.go (1)
196-207: 💤 Low valueCross-codebase sync required: voice mode defaults mirrored on frontend.
The comment on line 197 states "Mirrored on the frontend; keep in sync." This creates a maintenance burden where backend default changes must be manually replicated in the frontend code, and vice versa. If they drift, users see inconsistent defaults depending on whether settings were server-hydrated or client-initialized.
Consider one of:
- Expose defaults via API (e.g.,
GET /api/v1/user-settings/voice-mode/defaults)- Code-gen frontend defaults from backend source
- Shared constants file (if monorepo structure allows)
For now, document the exact frontend file/location in the comment so future maintainers know where to update both sides.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/internal/user/store/sqlite.go` around lines 196 - 207, Update the comment above defaultVoiceModeSettings to document the exact frontend file and symbol that mirror these defaults so maintainers can sync changes; mention models.VoiceModeSettings and defaultVoiceModeSettings by name and add the frontend path and exported constant/function name (e.g., frontend file and constant that must be updated) and a short note to keep them in sync or expose via API in future.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/backend/internal/user/service/service_test.go`:
- Around line 515-541: The test for partial updates using applyVoiceMode on a
models.UserSettings with only Engine set is missing assertions for the boolean
fields Enabled and AutoSend; because the service implementation unconditionally
assigns these fields they will be zeroed (false) during this partial update. Add
assertions after the applyVoiceMode call checking settings.VoiceMode.Enabled is
false and settings.VoiceMode.AutoSend is false (using t.Fatalf with clear
messages) so the test documents and locks in the current behavior for Enabled
and AutoSend when only Engine is provided.
In `@apps/web/lib/voice/whisper-web-client.ts`:
- Around line 108-115: The worker error handler currently only rejects the
in-flight promise and leaves a dead worker in place; update the "error" listener
on this.worker to terminate and clear the dead worker and reset state so
ensureWorker()/init/transcribe can recreate it: call this.worker.terminate() (if
available), set this.worker = null, set this.ready = false and
this.loadingModelId = null (and any other loading flags used), then reject and
clear this.pending as you already do; keep the rejection but ensure the
worker/state cleanup happens before resolving the pending logic so future calls
to ensureWorker() will create a fresh worker.
---
Nitpick comments:
In `@apps/backend/internal/user/store/sqlite.go`:
- Around line 196-207: Update the comment above defaultVoiceModeSettings to
document the exact frontend file and symbol that mirror these defaults so
maintainers can sync changes; mention models.VoiceModeSettings and
defaultVoiceModeSettings by name and add the frontend path and exported
constant/function name (e.g., frontend file and constant that must be updated)
and a short note to keep them in sync or expose via API in future.
In `@apps/web/lib/voice/whisper-web-models.ts`:
- Line 5: Update the stale top-of-file comment that references "Xenova/*" so it
correctly describes the actual mirror used (onnx-community/*) and the rationale
for avoiding Xenova; locate the comment above the model list (the `modelId` doc
comment near the `WHISPER_MODELS` entries) and replace it with a brief note that
entries use `onnx-community/*` and mention the MatMulNBits/WASM failures as the
reason Xenova mirrors were not used to prevent accidental regression.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f540a2bb-7d64-4c93-89c1-a5b26a929471
⛔ Files ignored due to path filters (1)
apps/pnpm-lock.yamlis excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (45)
apps/backend/cmd/kandev/helpers.goapps/backend/cmd/kandev/main.goapps/backend/internal/common/config/config.goapps/backend/internal/user/controller/controller.goapps/backend/internal/user/dto/dto.goapps/backend/internal/user/models/models.goapps/backend/internal/user/service/service.goapps/backend/internal/user/service/service_test.goapps/backend/internal/user/store/sqlite.goapps/backend/internal/voice/handlers/transcribe_handlers.goapps/backend/internal/voice/handlers/transcribe_handlers_test.goapps/backend/internal/voice/transcribe/service.goapps/backend/internal/voice/transcribe/service_test.goapps/web/app/settings/voice-mode/page.tsxapps/web/components/settings/editors-settings-state.tsxapps/web/components/settings/keyboard-shortcuts-card.tsxapps/web/components/settings/settings-app-sidebar.tsxapps/web/components/settings/voice-mode-settings.tsxapps/web/components/task/chat/chat-input-body.tsxapps/web/components/task/chat/chat-input-container.tsxapps/web/components/task/chat/chat-input-toolbar.tsxapps/web/components/task/chat/voice-input-button.tsxapps/web/hooks/use-user-display-settings.tsapps/web/hooks/use-voice-input.test.tsapps/web/hooks/use-voice-input.tsapps/web/lib/api/domains/settings-api.tsapps/web/lib/api/domains/voice-api.test.tsapps/web/lib/api/domains/voice-api.tsapps/web/lib/keyboard/constants.tsapps/web/lib/keyboard/shortcut-overrides.test.tsapps/web/lib/keyboard/shortcut-overrides.tsapps/web/lib/ssr/user-settings.test.tsapps/web/lib/ssr/user-settings.tsapps/web/lib/state/slices/settings/settings-slice.tsapps/web/lib/state/slices/settings/types.tsapps/web/lib/types/backend.tsapps/web/lib/types/http-voice.tsapps/web/lib/types/http.tsapps/web/lib/voice/capabilities.test.tsapps/web/lib/voice/capabilities.tsapps/web/lib/voice/whisper-web-client.tsapps/web/lib/voice/whisper-web-models.tsapps/web/lib/ws/handlers/users.tsapps/web/package.jsonapps/web/workers/whisper-web.worker.ts
There was a problem hiding this comment.
9 issues found
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="apps/web/hooks/use-voice-input.ts">
<violation number="1" location="apps/web/hooks/use-voice-input.ts:212">
P2: `cancel()` aborts Web Speech without detaching callbacks, so abort events can still surface as errors/transcripts after the user cancels.</violation>
</file>
<file name="apps/backend/internal/voice/transcribe/service_test.go">
<violation number="1" location="apps/backend/internal/voice/transcribe/service_test.go:58">
P2: Avoid `t.Fatalf` inside the server handler goroutine; use `t.Errorf` and return instead so the test fails safely without invoking `FailNow` from a non-test goroutine.</violation>
</file>
<file name="apps/backend/internal/voice/handlers/transcribe_handlers.go">
<violation number="1" location="apps/backend/internal/voice/handlers/transcribe_handlers.go:42">
P1: `/api/v1/transcribe` is exposed without authentication or throttling, which allows unauthenticated abuse of the server-side OpenAI key.</violation>
<violation number="2" location="apps/backend/internal/voice/handlers/transcribe_handlers.go:59">
P2: Oversized multipart bodies are misclassified as missing-file 400 errors; return 413 when `FormFile` fails due to body size limits.</violation>
</file>
<file name="apps/web/components/settings/voice-mode-settings.tsx">
<violation number="1" location="apps/web/components/settings/voice-mode-settings.tsx:404">
P2: Shortcut persistence also captures `userSettings` in a closure and re-applies it in the async error path, which can roll back unrelated newer changes. Fetch current store state inside the handler/catch instead of writing the captured snapshot.
(Based on your team's feedback about reading live Zustand state in async handlers to avoid stale-closure races.) [FEEDBACK_USED]</violation>
</file>
<file name="apps/backend/internal/voice/transcribe/service.go">
<violation number="1" location="apps/backend/internal/voice/transcribe/service.go:61">
P2: Guard against nil in `WithHTTPClient` to prevent a nil-pointer panic in `Transcribe`.</violation>
</file>
<file name="apps/web/lib/api/domains/voice-api.ts">
<violation number="1" location="apps/web/lib/api/domains/voice-api.ts:28">
P2: `options?.init` is spread after `method`/`body`, so caller-provided init can override the required multipart POST request and break transcription uploads.</violation>
</file>
Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.
Re-trigger cubic
Bot-review fixes from #1159: - Map http.MaxBytesError from c.FormFile to 413 instead of bare 400. - Strip BCP-47 region suffix before passing language hint to Whisper Web. - Recover from worker crash: terminate + null worker + reset ready flag so next init/transcribe creates a fresh worker instead of hanging. - Read live store state via storeApi.getState() in voice-mode settings savers to avoid stale-closure clobbering concurrent settings updates. - Detach Web Speech callbacks before abort() so trailing onend/onerror events don't mutate hook state after cancel(). - Spread caller init before method/body in voice-api fetch so a caller- supplied init can't override the multipart POST. - Replace t.Fatalf with t.Errorf+return inside test HTTP handler goroutine (FailNow from non-test goroutine is unsafe). - Add Enabled/AutoSend zeroing assertions in partial-update test so the bool-overwrite-on-PATCH behavior is locked in explicitly. E2E fix: - toolbar-overflow spec used button.rounded-full which now matches both the send and voice buttons (strict-mode violation). Switched to the submit-message-button testid for an unambiguous locator.
|
Claude finished @jcfs's task in 3m 59s —— View job Code Review — feat: add voice mode for chat input
FindingsBlocker (must fix before merge)
Suggestion (recommended, doesn't block)
Previous review issues — statusAll issues from the earlier
Summary
Verdict: Blocked — fix the two test-coverage blockers first. The backend plumbing (transcription service, config, user settings persistence, validation, error hierarchy) is well-structured throughout. The frontend hook, worker, and settings page are clean. The two blockers are purely on test coverage for the |
There was a problem hiding this comment.
4 issues found across 8 files (changes from recent commits).
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
…keys Round 2 of PR review feedback on #1159: - voice-mode settings rollback now restores only the keys patched by the failing request (not the whole snapshot), preserving concurrent edits to unrelated voiceMode fields. - Shortcut rollback diffs previous vs next and reverts only the changed keys (restore prior value or delete if absent), preserving other concurrent shortcut edits. - Whisper worker error handler captures the worker reference at attach time and only clears refs when this.worker still matches it — a stale error from a previously replaced worker no longer nukes the active one. - Update stale whisper-web-models JSDoc that still pointed at Xenova mirrors.
|
Claude finished @jcfs's task in 3m 51s —— View job Code Review — feat: add voice mode for chat input
FindingsBlocker (must fix before merge)
Suggestion (recommended, doesn't block)
Previous review findings — statusAll earlier blockers confirmed addressed as of
Summary
Verdict: Blocked — fix the progress scale bug before merge. One-liner fix at |
There was a problem hiding this comment.
2 issues found across 3 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
Round 3 of PR review feedback on #1159: - Normalise Whisper progress to 0–1 in ensureWhisperClient (transformers.js emits 0–100, the button's display + the `ready: 1` convention expect 0–1 — fixes the 5000% mid-download display). - Claim driverRef synchronously at the top of finishCapture so concurrent pointerup + pointerleave invocations early-return instead of racing and clobbering a freshly-started recording's ref. - Race-aware rollback: voice-mode and shortcut catch branches now skip reverting a key when latest[key] !== next[key]. A failed earlier request no longer overwrites a later successful save to the same key. - Update the Worker wire-protocol comment to show onnx-community/whisper-base instead of the stale Xenova/* example.
|
Claude finished @jcfs's task in 3m 54s —— View job Code Review — feat: add voice mode for chat input
FindingsPrevious blockers — all resolved in
|
| Severity | Count |
|---|---|
| Deferred blocker | 2 |
Verdict: All runtime bugs are fixed. The only open items are the two test-coverage gaps the author has explicitly deferred to a follow-up issue. The backend package (transcription service, config, user settings persistence, error hierarchy) is solid throughout. The frontend hook, settings page, worker, and rollback logic are now in good shape after the four rounds of fixes.
Adds voice input to the chat composer with three transcription engines (in-browser Web Speech, in-browser Whisper via transformers.js, server-side OpenAI Whisper fallback) and a
/settings/voice-modepage so users can pick engine, language, activation mode, auto-send, model size, and shortcut — or turn the whole feature off.Important Changes
internal/voicebackend package with aPOST /api/v1/transcribeendpoint behindKANDEV_VOICE_OPENAI_API_KEY(returns 503 when unset; frontend gracefully degrades).voice_modeJSON object — no schema migration; old rows defaultenabledto true via a*boolstorage adapter.@huggingface/transformerswithfp16decoder weights pulled fromonnx-community/whisper-*(sidesteps theMatMulNBitsscale-missing crash that q4/q8 hit on WASM).user.settings.updatedWS event so toggling on one tab updates every open chat tab live.Validation
make -C apps/backend test lint— all packages pass, 0 lint issues.cd apps && pnpm --filter @kandev/web typecheck lint test— green, 2517 tests pass.voice/transcribe,voice/handlers,useVoiceInput,parseVoiceMode,detectVoiceCapabilities,resolveActiveEngine,transcribeAudio,applyVoiceMode.Possible Improvements
Low risk for non-voice flows (additive). The Whisper Web first-load is ~75 MB on Base — would be smaller if we ever wire WebGPU + q4 detection. Server-side Whisper currently only reads a global
KANDEV_VOICE_OPENAI_API_KEY; a per-user/agent credential lookup is a natural follow-up but explicitly out of scope here.Checklist
apps/web/), I have added or updated Playwright e2e tests inapps/web/e2e/and verified them withmake test-e2e.Preview Environment
ffe88a6