You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
When transcribing text with VoxInput I want to show partial transcriptions. Looking further ahead I also want to have voice commands that don't require the user to press a button, this requires constant streaming and/or VAD.
Describe the solution you'd like
I could implement VAD in VoxInput and make regular requests to to LocalAI using the regular transcription API.
I prototyped this without VAD and it is pretty bad richiejp/VoxInput#2
Implement VAD in VoxInput, but there are other use-cases for this than VoxInput and it would be nice to keep it simple because I can't distribute that in a container very easily. Also could use Silero with a different backend in LocalAI.
Additional context
The text was updated successfully, but these errors were encountered:
yes, just for context some work was already done in #3722 . The only challenges left in there are the audio formats - it seems I've failed somehow in converting bytes unpacking between types and VAD doesn't seem to work correctly while testing e2e, but besides that that PR should be a good starting point.
Is your feature request related to a problem? Please describe.
When transcribing text with VoxInput I want to show partial transcriptions. Looking further ahead I also want to have voice commands that don't require the user to press a button, this requires constant streaming and/or VAD.
Describe the solution you'd like
I could implement VAD in VoxInput and make regular requests to to LocalAI using the regular transcription API.
I prototyped this without VAD and it is pretty bad richiejp/VoxInput#2
It would be nice to use Silero VAD model in Whisper and at that point we may as well implement the full streaming API
https://platform.openai.com/docs/guides/realtime-transcription
https://platform.openai.com/docs/api-reference/realtime-client-events/input_audio_buffer
Describe alternatives you've considered
Implement VAD in VoxInput, but there are other use-cases for this than VoxInput and it would be nice to keep it simple because I can't distribute that in a container very easily. Also could use Silero with a different backend in LocalAI.
Additional context
The text was updated successfully, but these errors were encountered: