Skip to content

Realtime transcription API and VAD #5377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
richiejp opened this issue May 16, 2025 · 4 comments
Closed

Realtime transcription API and VAD #5377

richiejp opened this issue May 16, 2025 · 4 comments
Labels
enhancement New feature or request

Comments

@richiejp
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

When transcribing text with VoxInput I want to show partial transcriptions. Looking further ahead I also want to have voice commands that don't require the user to press a button, this requires constant streaming and/or VAD.

Describe the solution you'd like

I could implement VAD in VoxInput and make regular requests to to LocalAI using the regular transcription API.
I prototyped this without VAD and it is pretty bad richiejp/VoxInput#2

It would be nice to use Silero VAD model in Whisper and at that point we may as well implement the full streaming API
https://platform.openai.com/docs/guides/realtime-transcription
https://platform.openai.com/docs/api-reference/realtime-client-events/input_audio_buffer

Describe alternatives you've considered

Implement VAD in VoxInput, but there are other use-cases for this than VoxInput and it would be nice to keep it simple because I can't distribute that in a container very easily. Also could use Silero with a different backend in LocalAI.

Additional context

@richiejp richiejp added the enhancement New feature or request label May 16, 2025
@richiejp
Copy link
Collaborator Author

Ah I see there is already a VAD endpoint, so I could use that, but I would still need to do a bunch of work on the client side.

@mudler
Copy link
Owner

mudler commented May 17, 2025

yes, just for context some work was already done in #3722 . The only challenges left in there are the audio formats - it seems I've failed somehow in converting bytes unpacking between types and VAD doesn't seem to work correctly while testing e2e, but besides that that PR should be a good starting point.

@mudler
Copy link
Owner

mudler commented May 17, 2025

This is probably to be considered a duplicate of #3714

@richiejp
Copy link
Collaborator Author

Right, this is just a sub goal in that. I'll take a look at the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants