voice diarization integration #582

OCEISCO · 2025-09-24T17:43:38Z

OCEISCO
Sep 24, 2025

I want to start by saying thank you for this incredible project. I have never had such a smooth installation process, especially when it comes to hardware support. The GPU passthrough worked flawlessly with my new NVIDIA RTX 5060 and 5070 cards, and the API is unbelievably fast and stable.

My primary use case is integrating this as the main STT/TTS pipeline for my real-time voice assistant in Home Assistant. Because of the project's high performance, I can now handle the entire audio stream in a single container. To get it fully working with the Home Assistant server, I built a custom integration, which has been a great success.

As I continue to build on this powerful foundation, I have two suggestions that I believe would be a fantastic addition:

Enhanced STT with Speaker Diarization: It would be a game-changer to have speaker diarization capabilities. For multi-user environments like a smart home, distinguishing between speakers is essential. An integration of a model like whisperX would be a phenomenal feature.
A Pre-configured "Development Environment Image": This project's ability to correctly configure and leverage new hardware is one of its best features. It would be a massive help to have preconfigure docker contrainer, essentially a "Development Environment Image." This would provide a complete, out-of-the-box toolkit for users, especially those with new hardware that lacks established tooling.

Thank you again for all your hard work. This has already become an essential part of my smart home ecosystem.

rsxdalv · 2025-09-24T22:23:48Z

rsxdalv
Sep 24, 2025
Maintainer

Thanks for reaching out!
I understand #1 and we could integrate that within the OpenAI API /transcription endpoint, which is where you'd want it, I assume?
I'm not sure what #2 refers to, could you clarify that more? It's mostly confusing because if there is a lack of tooling (e.g pytorch), how would I provide it, and which platform? Or is the tooling the "TTS WebUI"? And mostly I'm interested in how that is different from the current Docker image. Is it more developer-focused, so should it include additional tools for development?

1 reply

OCEISCO Sep 25, 2025
Author

If it doesn't conflict with other packages, WhisperX would be a great addition. It would help recognize speaker IDs during speech-to-text (STT), a task I'm currently handling with the latest Whisper addition to the web UI by sending the ID in the pipeline payload. The LLM would then be able to contextualize its response based on who it's talking to. As I understand it, WhisperX wraps Whisper.

Regarding the TTS Webui Docker container, I find it rare that a setup runs smoothly with proper GPU/CUDA passthrough. When using new hardware, I more often than not find myself spending a substantial amount of time putting out fires caused by conflicts between Python, various packages, and other integrations, including the CUDA toolkit.

The TTS WebUI setup just works flawlessly right out of the box. Everything from its excellent UI/UX and functions to the process of installing and testing extensions works perfectly. It provides visually organized feedback with all the tooling I would expect from this type of development environment.

Since you have already built it, it functions as a skeleton for running extensions, and the environment, CUDA passthrough, and compatibility just work, I would love to see a version of the docker container, serve as a development environment for testing other associated integrations, that will allow me as a user to test/break/repeat all sort of STT, TTS , wake words, etc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

voice diarization integration #582

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

voice diarization integration #582

Uh oh!

OCEISCO Sep 24, 2025

Replies: 1 comment · 1 reply

Uh oh!

rsxdalv Sep 24, 2025 Maintainer

Uh oh!

Uh oh!

OCEISCO Sep 25, 2025 Author

OCEISCO
Sep 24, 2025

Replies: 1 comment 1 reply

rsxdalv
Sep 24, 2025
Maintainer

OCEISCO Sep 25, 2025
Author