Skip to content

Conversation

@longcw
Copy link
Contributor

@longcw longcw commented Nov 21, 2025

add a STTCapabilities.flush to indicate if the stt supports flush (manual commit), and make stt.StreamAdapter work with streaming STT.

use cases:

  1. only send audio frames to STT when VAD detects user speech.
  2. support manual commit of elevenLabs scribe v2, fix feat(elevenlabs): add STTv2 with streaming support for Scribe v2 #3909 (review)

related to #3881, should be merge to main when #4041 is done

Copy link
Contributor

@chenghao-mou chenghao-mou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is off here. Whenever I take a pause longer than a few seconds, the connection will throw a APIStatusError(message="ElevenLabs STT connection closed unexpectedly") with a WSMessage(type=<WSMsgType.CLOSE: 8>, data=1000, extra=''), but it doesn't happen with the non-wrapped version.

My hunch is that it doesn't like empty audio data when committing.

Even if I changed the ping time to 1s, it still throws the same error. Not sure if the issue is on our end.

@longcw
Copy link
Contributor Author

longcw commented Nov 21, 2025

Something is off here. Whenever I take a pause longer than a few seconds, the connection will throw a APIStatusError(message="ElevenLabs STT connection closed unexpectedly") with a WSMessage(type=<WSMsgType.CLOSE: 8>, data=1000, extra=''), but it doesn't happen with the non-wrapped version.

it seems the elevenlabs STT has a timeout on audio input, maybe need an option to aways send audio to the STT.

update: add a silence_mode: Literal["drop", "zeros", "passthrough"] option to send original or zero filled frames when VAD is negative. it's true that not every STT supports discontinued audio frames.

@longcw longcw closed this Nov 21, 2025
@longcw longcw reopened this Nov 21, 2025
@chenghao-mou
Copy link
Contributor

Something is off here. Whenever I take a pause longer than a few seconds, the connection will throw a APIStatusError(message="ElevenLabs STT connection closed unexpectedly") with a WSMessage(type=<WSMsgType.CLOSE: 8>, data=1000, extra=''), but it doesn't happen with the non-wrapped version.

it seems the elevenlabs STT has a timeout on audio input, maybe need an option to aways send audio to the STT.

update: add a silence_mode: Literal["drop", "zeros", "passthrough"] option to send original or zero filled frames when VAD is negative. it's true that not every STT supports discontinued audio frames.

Thanks for adding that option this quickly. However, I don't think it works well with 11labs: I am getting this:

    11:15:00 DEBUG  livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'}  
    11:15:03 DEBUG  livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'}  
    11:15:04 DEBUG  livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'}  
    11:15:07 DEBUG  livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'}  
    11:15:08 DEBUG  livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'} 

with the zero silence.

@longcw
Copy link
Contributor Author

longcw commented Nov 21, 2025

Thanks for adding that option this quickly. However, I don't think it works well with 11labs: I am getting this:

    11:15:00 DEBUG  livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'}  
    11:15:03 DEBUG  livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'}  
    11:15:04 DEBUG  livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'}  
    11:15:07 DEBUG  livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'}  
    11:15:08 DEBUG  livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'} 

with the zero silence.

I think that's the issue of elevenlab, even passthrough the audio, it may generate either these tags or some random characters if there is a slight background noise.

when we enabled the interruption from interim transcript, this actually breaks the agent playout and for now I don't think there is a good solution. I would expect they will improve their VAD model or fix this.

@chenghao-mou
Copy link
Contributor

Yeah, I agree. Should we just add a warning somewhere in the example or readme? I think it is totally fine to have the implementation available.

Copy link
Contributor

@chenghao-mou chenghao-mou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@longcw
Copy link
Contributor Author

longcw commented Nov 21, 2025

@chenghao-mou update: it seems it elevenlabs STT works when server VAD is disabled it's better when server VAD is disabled, but still sometimes got some random output from STT, because it will generate a text no matter it's silent or even just noise.

stt=stt.StreamAdapter(
            stt=elevenlabs.STT(
                use_realtime=True,
                server_vad=None,  # disable server-side VAD
                language_code="en",
            ),
            vad=ctx.proc.userdata["vad"],
            use_streaming=True,
        ),

you can test it with this example https://github.com/livekit/agents/blob/longc/stream-stt-flush/examples/other/elevenlab_scribe_v2.py

@chenghao-mou
Copy link
Contributor

@chenghao-mou update: it seems it elevenlabs STT works when server VAD is disabled it's better when server VAD is disabled, but still sometimes got some random output from STT, because it will generate a text no matter it's silent or even just noise.

stt=stt.StreamAdapter(
            stt=elevenlabs.STT(
                use_realtime=True,
                server_vad=None,  # disable server-side VAD
                language_code="en",
            ),
            vad=ctx.proc.userdata["vad"],
            use_streaming=True,
        ),

you can test it with this example longc/stream-stt-flush/examples/other/elevenlab_scribe_v2.py

Yes, that was how I tested. It just hallucinates a lot no matter what options I tried.

Base automatically changed from longc/11labs-stt-realtime to main November 22, 2025 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants