Skip to content

32GB VRAM Docker setup, E2E tests, reference preencode/upload, streaming support#1193

Draft
konovalov-nk wants to merge 4 commits intofishaudio:mainfrom
konovalov-nk:feature/32gb-docker-e2e
Draft

32GB VRAM Docker setup, E2E tests, reference preencode/upload, streaming support#1193
konovalov-nk wants to merge 4 commits intofishaudio:mainfrom
konovalov-nk:feature/32gb-docker-e2e

Conversation

@konovalov-nk
Copy link

@konovalov-nk konovalov-nk commented Mar 15, 2026

Streaming works on this branch — low TTFA (~400ms with torch.compile flag), chunks as they're ready. Docker + scripts for ~32GB GPUs (e.g. RTX 5090), so anyone can try it. Draft, don't merge 🤣

Branch: konovalov-nk/fish-speech@feature/32gb-docker-e2e
Docs: docs/docker-32gb-rtx5090.md

Minimal run: From repo root: make run-server (start API in Docker), then make e2e (smoke test). make help for all targets.

  • scripts/run_server_32gb.sh, WORKSPACE_DIR for nested repo
  • KV cache / memory: clear_caches(), FISH_CACHE_MAX_SEQ_LEN, /v1/debug/memory
  • E2E: scripts/e2e_smoke.sh, e2e_memory.sh; uses reference_id from server when available
  • References: preencode, upload_references.sh, POST /v1/references/add_encoded (skip if hash matches)
  • Makefile: run-server, e2e, preencode, upload-references, test

What's left to do:

  1. Stream tokens into vocoder with a schedule (per lengyue), not one big chunk.
  2. Cut memory use more and improve TTFA (profile, smaller first chunk, CUDA graphs).
  3. Support longer prompts (~30–50 words) for agent TTS without OOM.

konovalov-nk and others added 2 commits March 15, 2026 05:31
- Docker: run_server_32gb.sh, WORKSPACE_DIR for nested repo, docs/docker-32gb-rtx5090.md
- Memory: clear_caches() after request, FISH_CACHE_MAX_SEQ_LEN/MAX_NEW_TOKENS_CAP, /v1/debug/memory
- E2E: scripts/e2e_smoke.sh, e2e_memory.sh; use reference_id from server when available
- References: preencode (scripts/preencode.sh), upload (upload_references.sh), add_encoded API, hash skip
- Makefile: run-server, e2e, e2e-memory, preencode, preencode-upload, upload-references, test
- Default voice refs dir: data/voice_references
- .gitignore: memory_metrics.jsonl, .pytest_cache, memory_snapshot_*.pickle
- Fix: views non-streaming TTS (engine.inference), add_reference finally indentation

Made-with: Cursor
@konovalov-nk konovalov-nk changed the title 32GB VRAM Docker setup, E2E tests, reference preencode/upload 32GB VRAM Docker setup, E2E tests, reference preencode/upload, streaming support Mar 15, 2026
konovalov-nk and others added 2 commits March 15, 2026 08:04
…TFA metrics, inductor warning filter, warmup logs

- run_server_32gb.sh: --entrypoint for huggingface-cli download (avoid start_webui.sh/uv in container)
- e2e_smoke.sh: curl --compressed for JSON endpoints; jq parse fallback; ttfa_smoke.py for streaming + oneshot TTFA/total_s
- ttfa_smoke.py: one TTS request with timing (ttfa_s, ttfa_audio_s, total_s), supports --oneshot
- api_server.py: filter UserWarning from torch._inductor (Logical operators and/or deprecated)
- model_manager.py: clear warmup logs (torch.compile enabled / warmup finished) so compile progress is visible
- Makefile: pass COMPILE to run-server, remove e2e-compile target, update help

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant