Skip to content

Releases: pipecat-ai/pipecat

v1.0.0

14 Apr 19:11
457f55e

Choose a tag to compare

Migration guide: https://docs.pipecat.ai/pipecat/migration/migration-1.0

Added

  • Updated LemonSlice transport:

    • Added on_avatar_connected and on_avatar_disconnected events triggered when the avatar joins and leaves the room.
    • Added api_url parameter to LemonSliceNewSessionRequest to allow overriding the LemonSlice API endpoint.
    • Added support for passing arbitrary named parameters to the LemonSlice API endpoint.
      (PR #3995)
  • Added Inworld Realtime LLM service with WebSocket-based cascade STT/LLM/TTS, semantic VAD, function calling, and Router support.
    (PR #4140)

  • ⚠️ Added WebSocket-based OpenAIResponsesLLMService as the new default for the OpenAI Responses API. It maintains a persistent connection to wss://api.openai.com/v1/responses and automatically uses previous_response_id to send only incremental context, falling back to full context on reconnection or cache miss. The previous HTTP-based implementation is now available as OpenAIResponsesHttpLLMService.
    (PR #4141)

  • Added group_parallel_tools parameter to LLMService (default True). When True, all function calls from the same LLM response batch share a group ID and the LLM is triggered exactly once after the last call completes. Set to False to trigger inference independently for each function call result as it arrives.
    (PR #4217)

  • Added async function call support to register_function() and register_direct_function() via cancel_on_interruption=False. When set to False, the LLM continues the conversation immediately without waiting for the function result. The result is injected back into the context as a developer message once available, triggering a new LLM inference at that point.
    (PR #4217)

  • Added enable_prompt_caching setting to AWSBedrockLLMService for Bedrock ConverseStream prompt caching.
    (PR #4219)

  • Added support for streaming intermediate results from async function calls. Call result_callback multiple times with properties=FunctionCallResultProperties(is_final=False) to push incremental updates, then call it once more (with is_final=True, the default) to deliver the final result. Only valid for functions registered with cancel_on_interruption=False.
    (PR #4230)

  • Added LLMMessagesTransformFrame to facilitate programmatically editing context in a frame-based way.

    The previous approach required the caller to directly grab a reference to the context object, grab a "snapshot" of its messages at that point in time, transform the messages, and then push an LLMMessagesUpdateFrame with the transformed messages. This approach can lead to problems: what if there had already been a change to the context queued in the pipeline? The transformed messages would simply overwrite it without consideration.
    (PR #4231)

  • The development runner now exports a module-level app FastAPI instance (from pipecat.runner.run import app) so you can register custom routes before calling main().
    (PR #4234)

  • ToolsSchema now accepts custom_tools for OpenAI LLM services (OpenAILLMService, OpenAIResponsesLLMService, OpenAIResponsesHttpLLMService, and OpenAIRealtimeLLMService), letting you pass provider-specific tools like tool_search alongside standard function tools.
    (PR #4248)

  • Added enhancements to NvidiaTTSService:

    • Cross-sentence stitching: multiple sentences within an LLM turn are fed into a single SynthesizeOnline gRPC stream for seamless audio across sentence boundaries (requires Magpie TTS model v1.7.0+).
    • custom_dictionary and encoding parameters for IPA-based custom pronunciation and output audio encoding.
    • Metrics generation (can_generate_metrics returns true) and stop_all_metrics() when an audio context is interrupted.
    • gRPC error handling around synthesis config retrieval (GetRivaSynthesisConfig).
      (PR #4249)
  • Added MistralTTSService for streaming text-to-speech using Mistral's Voxtral TTS API (voxtral-mini-tts-2603). Supports SSE-based audio streaming with automatic resampling from the API's native 24kHz to any requested sample rate. Requires the mistral optional extra (pip install pipecat-ai[mistral]).
    (PR #4251)

  • Added truncate_large_values parameter to LLMContext.get_messages(). When True, returns compact deep copies of messages with binary data (base64 images, audio) replaced by short placeholders and long string values in LLM-specific messages recursively truncated. Useful for serialization, logging, and debugging tools.
    (PR #4272)

  • CartesiaSTTService now supports runtime settings updates (e.g. changing language or model via STTUpdateSettingsFrame). The service automatically reconnects with the new parameters. Previously, settings updates were silently ignored.
    (PR #4282)

  • Added pcm_32000 and pcm_48000 sample rate support to ElevenLabs TTS services.
    (PR #4293)

  • Added enable_logging parameter to ElevenLabsHttpTTSService. Set to False to enable zero retention mode (enterprise only).
    (PR #4293)

Changed

  • Updated onnxruntime from 1.23.2 to 1.24.3, adding support for Python 3.14.
    (PR #3984)

  • MCPClient now requires async with MCPClient(...) as mcp: or explicit start()/close() calls to manage the connection lifecycle.
    (PR #4034)

  • ⚠️ Updated langchain extra to require langchain 1.x (from 0.3.x), langchain-community 0.4.x (from 0.3.x), and langchain-openai 1.x (from 0.3.x). If you pin these packages in your project, update your pins accordingly.
    (PR #4192)

  • WebsocketService reconnection errors are now non-fatal. When a websocket service exhausts its reconnection attempts (either via exponential backoff or quick failure detection), it emits a non-fatal ErrorFrame instead of a fatal one. This allows application-level failover (e.g. ServiceSwitcher) to handle the failure instead of killing the entire pipeline.
    (PR #4201)

  • Changed GrokLLMService default model from grok-3-beta to grok-3, now that the model is generally available.
    (PR #4209)

  • GoogleImageGenService now defaults to imagen-4.0-generate-001 (previously imagen-3.0-generate-002).
    (PR #4213)

  • ⚠️ BaseOpenAILLMService.get_chat_completions() now accepts an LLMContext instead of OpenAILLMInvocationParams. If you override this method, update your signature accordingly.
    (PR #4215)

  • When multiple function calls are returned in a single LLM response, by default (when group_parallel_tools=True) the LLM is now triggered exactly once after the last call in the batch completes, rather than waiting for all function calls.
    (PR #4217)

  • ⚠️ LLMService.function_call_timeout_secs now defaults to None instead of 10.0. Deferred function calls will run indefinitely unless a timeout is explicitly set at the service level or per-call. If you relied on the previous 10-second default, pass function_call_timeout_secs=10.0 explicitly.
    (PR #4224)

  • Updated NvidiaTTSService:

    • Made api_key optional for local NIM deployments.
    • Voice, language, and quality can be updated without reconnecting the gRPC client; new values take effect on the next synthesis turn, not for the current turn's in-flight requests.
    • Replaced per-sentence synchronous synthesize_online calls with async queue-backed gRPC streaming.
    • Streaming now uses asyncio tasks with explicit gRPC cancellation on interruption and stale-response filtering when a stream is aborted or replaced.
    • Renamed Riva references to Nemotron Speech in docs and messages.
    • Disabled automatic TTS start frames at the service level (push_start_frame=False) and emit TTSStartedFrame when a stitched synthesis stream is started for a context.
      (PR #4249)

Removed

  • ⚠️ Removed OpenPipeLLMService and the openpipe extra. OpenPipe was acquired by CoreWeave and the package is no longer maintained. If you were using openpipe as an LLM provider, switch to the underlying provider directly (e.g. openai). The OpenPipe interface can still be used with OpenAILLMService by specifying a base_url.
    (PR #4191)

  • ⚠️ Removed NoisereduceFilter. Use system-level noise reduction or a service-based alternative instead.
    (PR #4204)

  • ⚠️ Removed deprecated vad_enabled and vad_audio_passthrough transport params.
    (PR #4204)

  • ⚠️ Removed deprecated camera_in_enabled, camera_in_is_live, camera_in_width, camera_in_height, `came...

Read more

v0.0.108

28 Mar 04:48
a84c698

Choose a tag to compare

Added

  • Added SarvamLLMService with support for sarvam-30b, sarvam-30b-16k, sarvam-105b and sarvam-105b-32k.
    (PR #3978)

  • Added on_turn_context_created(context_id) hook to TTSService. Override this to perform provider-specific setup (e.g. eagerly opening a server-side context) before text starts flowing. Called each time a new turn context ID is created.
    (PR #4013)

  • Added XAIHttpTTSService for text-to-speech using xAI's HTTP TTS API.
    (PR #4031)

  • Added support for "developer" role messages in conversation context across all LLM adapters. For non-OpenAI services (Anthropic, Google, AWS Bedrock), "developer" messages are converted to "user" messages (use system_instruction to set the system instruction). For OpenAI services, "developer" messages pass through in conversation history. For the Responses API, they are kept as "developer" role (matching the existing "system" → "developer" conversion).
    (PR #4089)

  • Added SmallestTTSService, a WebSocket-based TTS service integration with Smallest AI's Waves API. Supports the Lightning v2 and v3.1 models with configurable voice, language, speed, consistency, similarity, and enhancement settings.
    (PR #4092)

  • Added warnings in turn stop strategies when VADParams.stop_secs differs from the recommended default (0.2s) or when stop_secs >= STT p99 latency, which collapses the STT wait timeout to 0s and may cause delayed turn detection. The warnings guide developers to re-run the stt-benchmark with their VAD settings.
    (PR #4115)

  • Added domain parameter to AssemblyAISTTSettings for specialized recognition modes such as Medical Mode (domain="medical-v1").
    (PR #4117)

  • Added NovitaLLMService for using Novita AI's LLM models via their OpenAI-compatible API.
    (PR #4119)

  • Added cleanup() method to VADAnalyzer and VADController so VAD analyzer resources are properly released when no longer needed. Custom VADAnalyzer subclasses can override cleanup() to free any held resources.
    (PR #4120)

  • Added on_end_of_turn event handler to AssemblyAISTTService. This fires after the final transcript is pushed, providing a reliable hook for end-of-turn logic that doesn't race with TranscriptionFrame. Works in both Pipecat and AssemblyAI turn detection modes.
    (PR #4128)

  • Added DeepgramFluxSageMakerSTTService for running Deepgram Flux speech-to-text on AWS SageMaker endpoints. Use with ExternalUserTurnStrategies to take advantage of Flux's turn detection.
    (PR #4143)

  • Added Mem0MemoryService.get_memories() convenience method for retrieving all stored memories outside the pipeline (e.g. to build a personalized greeting at connection time). This avoids the need to manually handle client type branching, filter construction, and async wrapping.
    (PR #4156)

Changed

  • Added context prewarming path for InworldTTSService to improve first audio latency.
    (PR #4013)

  • Added KrispVivaVadAnalyzer for Voice Activity Detection using the Krisp VIVA SDK (requires krisp_audio).
    (PR #4022)

  • Modified InworldTTSService to close context at end of turn instead of relying on idle timeout. (PR #4028)

  • Added Gemini 3 support to the Gemini Live service.
    (PR #4078)

  • TTSService: the default stop_frame_timeout_s (idle time before an automatic TTSStoppedFrame is pushed when push_stop_frames=True) has changed from 2.0 to 3.0 seconds.
    (PR #4084)

  • ⚠️ GeminiLLMAdapter now only treats messages[0] as the initial system message, matching all other adapters. Previously it searched for the first "system" message anywhere in the conversation history. A "system" message appearing later in the list will now be converted to "user" instead of being extracted as the system instruction.

    (PR #4089)

  • Fixed InworldTtsService to fallback to full text when TTS timestamps are not received.
    (PR #4113)

  • ⚠️ Realtime services (Gemini Live, OpenAI Realtime, Grok Realtime, Nova Sonic) now prefer system_instruction from service settings over an initial system message in the LLM context, matching the behavior of non-realtime services. Previously, context-provided system instructions took precedence. A warning is now logged when both are set.
    (PR #4130)

  • Bumped nvidia-riva-client minimum version to >=2.25.1.
    (PR #4136)

  • Upgraded protobuf from 5.x to 6.x (>=6.31.1,<7).
    (PR #4136)

  • Unrecognized language strings (e.g. Deepgram's "multi") no longer produce a warning at startup. The log message has been downgraded to debug level since these are valid service-specific values that are passed through correctly.
    (PR #4137)

  • GrokLLMService and GrokRealtimeLLMService now live in the pipecat.services.xai module alongside XAIHttpTTSService, since all three use the same xAI API. Update imports from pipecat.services.grok.* to pipecat.services.xai.* (e.g. from pipecat.services.xai.llm import GrokLLMService).
    (PR #4142)

  • ⚠️ Bumped mem0ai dependency from ~=0.1.94 to >=1.0.8,<2. Users of the mem0 extra will need to update their mem0ai package.
    (PR #4156)

Deprecated

  • pipecat.services.grok.llm, pipecat.services.grok.realtime.llm, and
    pipecat.services.grok.realtime.events are deprecated. The old import paths
    still work but emit a DeprecationWarning; use pipecat.services.xai.llm,
    pipecat.services.xai.realtime.llm, and
    pipecat.services.xai.realtime.events instead.
    (PR #4142)

Removed

  • ⚠️ TTSService.add_word_timestamps() no longer supports the "Reset" and "TTSStoppedFrame" sentinel strings. If you have a custom TTS service that called await self.add_word_timestamps([("Reset", 0)]) or await self.add_word_timestamps([("TTSStoppedFrame", 0), ("Reset", 0)], ctx_id), replace them with await self.append_to_audio_context(ctx_id, TTSStoppedFrame(context_id=ctx_id)) and let _handle_audio_context manage the word-timestamp reset automatically.
    (PR #4145)

  • Removed SambaNovaSTTService. SambaNova no longer offers speech-to-text audio models. Use another STT provider instead.
    (PR #4154)

Fixed

  • Fixed Gemini Live (GoogleGeminiLiveLLMService) not honoring settings.system_instruction. The system instruction was being read from a deprecated constructor parameter instead of the settings object, causing it to be silently ignored.
    (PR #4089)

  • Fixed AWSBedrockLLMAdapter sending an empty message list to the API when the only message in context was a system message. The lone system message is now converted to "user" role instead of being extracted, matching the existing Anthropic adapter behavior.
    (PR #4089)

  • Fixed Gemini Live pipeline hanging indefinitely when an EndFrame was deferred while waiting for the bot to finish responding and turn_complete never arrived. As a possible root-cause fix, turn_complete messages are now handled even if they lack usage_metadata. As a fallback, the deferred EndFrame now has a 30-second safety timeout.
    (PR #4125)

  • Fixed ElevenLabs WebSocket disconnections (1008 "Maximum simultaneous contexts exceeded") caused by rapid user interruptions. When interruptions arrived before any TTS text was generated, phantom contexts were created on the ElevenLabs server that were never closed, eventually exceeding the 5-context limit.
    (PR #4126)

  • Fixed the final sentence being dropped from the conversation context when using RTVI text input with non-word-timestamp TTS services. The LLMFullResponseEndFrame was racing ahead of the last TTSTextFrame, causing the LLMAssistantAggregator to finalize the context before the final sentence arrived.
    (PR #4127)

  • Fixed audio crackling and popping in recordings when both user and bot are speaking. AudioBufferProcessor no longer injects silence into a track's buffer while that track is actively producing audio, preventing mid-utterance interruptions in the recorded output.
    (PR #4135)

  • Fixed websocket TTS word timestamps so interrupted contexts cannot leak stale words or backward PTS values into later turns.
    (PR #4145)
    ...

Read more

v0.0.107

24 Mar 03:18
7414b30

Choose a tag to compare

Added

  • Added frame_order parameter to SyncParallelPipeline. Set frame_order=FrameOrder.PIPELINE to push synchronized output frames in pipeline definition order (all frames from the first pipeline, then the second, etc.) instead of the default arrival order.
    (PR #4029)

  • Added sync_with_audio field to OutputImageRawFrame. When set to True, the output transport queues image frames with audio so they are displayed only after all preceding audio has been sent, enabling synchronized audio/image playback.
    (PR #4029)

  • Added OpenAIResponsesLLMService, a new LLM service that uses the OpenAI Responses API. Supports streaming text, function calling, usage metrics, and out-of-band inference. Works with the universal LLMContext and LLMContextAggregatorPair. See examples/foundational/07-interruptible-openai-responses.py and 14-function-calling-openai-responses.py.
    (PR #4074)

  • Added audio_out_auto_silence parameter to TransportParams (defaults to True). When set to False, the transport waits for audio data instead of inserting silence when the output queue is empty, which is useful for scenarios that require uninterrupted audio playback without artificial gaps.
    (PR #4104)

Changed

  • Renamed tracing span attributes to align with OpenTelemetry GenAI semantic conventions: gen_ai.system to gen_ai.provider.name, system to gen_ai.system_instructions, gen_ai.usage.cache_read_input_tokens to gen_ai.usage.cache_read.input_tokens, and gen_ai.usage.cache_creation_input_tokens to gen_ai.usage.cache_creation.input_tokens.
    (PR #3449)

  • DeepgramSageMakerTTSService now correctly routes audio through the base TTSService audio context queue. Audio frames are delivered via append_to_audio_context() instead of being pushed directly, enabling proper ordering, interruption handling, and start/stop frame lifecycle management. Interruptions now trigger a Clear message to Deepgram (flushing its text buffer) at the right time via on_audio_context_interrupted.
    (PR #4083)

  • GradiumTTSService now sends a per-context setup message with client_req_id before the first text message for each TTS context, following Gradium's multiplexing protocol. Previously, a single setup message was sent at connection time without a client_req_id, which prevented Gradium from associating requests with their sessions when using close_ws_on_eos=False.
    (PR #4091)

Fixed

  • Fixed stale system_instruction in LLM tracing spans by reading from _settings.system_instruction instead of the removed _system_instruction attribute.
    (PR #3449)

  • Fixed SyncParallelPipeline breaking the Whisker debugger.
    (PR #4029)

  • Fixed SyncParallelPipeline race condition where concurrent SystemFrame processing (e.g. from RTVI) could corrupt sink queues and cause deadlocks. SystemFrames now take a fast path that passes them through without draining queued output.
    (PR #4029)

  • Fixed TTS frame ordering so that non-system frames always arrive in correct order relative to the TTSStartedFrame/TTSAudioRawFrame/TTSStoppedFrame sequence. Previously these frames could race ahead of or behind audio context frames, producing out-of-order output downstream.
    (PR #4075)

  • Fixed SarvamTTSService audio and error frames now route through append_to_audio_context() instead of push_frame(), ensuring correct behavior with audio contexts and interruptions.
    (PR #4082)

  • Fixed audio frame ordering and interruption handling in Fish Audio, LMNT, Neuphonic, and Rime NonJson TTS services. These services were bypassing the base TTSService audio context serialization queue by pushing audio frames directly, which could cause out-of-order frames and broken interruptions during speech.
    (PR #4090)

  • Fixed Genesys AudioHook serializer to always include the parameters field in protocol messages. The AudioHook protocol requires every message to carry a parameters object (even if empty), but _create_message omitted it when no parameters were provided. This caused clients that validate message structure (including the Genesys reference implementation) to reject pong and parameter-less closed responses, breaking server sequence tracking and preventing outputVariables from reaching the Architect flow.
    (PR #4093)

v0.0.106

19 Mar 06:43
8750c26

Choose a tag to compare

Added

  • Added optional service field to ServiceUpdateSettingsFrame (and its subclasses LLMUpdateSettingsFrame, TTSUpdateSettingsFrame, STTUpdateSettingsFrame) to target a specific service instance. When service is set, only the matching service applies the settings; others forward the frame unchanged. This enables updating a single service when multiple services of the same type exist in the pipeline.
    (PR #4004)

  • Added sip_provider and room_geo parameters to configure() in the Daily runner. These convenience parameters let callers specify a SIP provider name and geographic region directly without manually constructing DailyRoomProperties and DailyRoomSipParams.
    (PR #4005)

  • Added PerplexityLLMAdapter that automatically transforms conversation messages to satisfy Perplexity's stricter API constraints (strict role alternation, no non-initial system messages, last message must be user/tool). Previously, certain conversation histories could cause Perplexity API errors that didn't occur with OpenAI (PerplexityLLMService subclasses OpenAILLMService since Perplexity uses an OpenAI-compatible API).
    (PR #4009)

  • Added DTMF input event support to the Daily transport. Incoming DTMF tones are now received via Daily's on_dtmf_event callback and pushed into the pipeline as InputDTMFFrame, enabling bots to react to keypad presses from phone callers.
    (PR #4047)

  • Added WakePhraseUserTurnStartStrategy for triggering user turns based on wake phrases, with support for single_activation mode. Deprecates WakeCheckFilter.
    (PR #4064)

  • Added default_user_turn_start_strategies() and default_user_turn_stop_strategies() helper functions for composing custom strategy lists.
    (PR #4064)

Changed

  • Changed tool result JSON serialization to use ensure_ascii=False, preserving UTF-8 characters instead of escaping them. This reduces context size and token usage for non-English languages.
    (PR #3457)

  • OpenAIRealtimeSTTService's noise_reduction parameter is now part of OpenAIRealtimeSTTSettings, making it runtime-updatable via STTUpdateSettingsFrame. The direct noise_reduction init argument is deprecated as of 0.0.106.
    (PR #3991)

  • Updated sarvamai dependency from 0.1.26a2 (alpha) to 0.1.26 (stable release).
    (PR #3997)

  • SimliVideoService now extends AIService instead of FrameProcessor, aligning it with the HeyGen and Tavus video services. It supports SimliVideoService.Settings(...) for configuration and uses start()/stop()/cancel() lifecycle methods. Existing constructor usage (api_key, face_id, etc.) remains unchanged.
    (PR #4001)

  • Update pipecat-ai-small-webrtc-prebuilt to 2.4.0.
    (PR #4023)

  • Nova Sonic assistant text transcripts are now delivered in real-time using speculative text events instead of delayed final text events. Previously, assistant text only arrived after all audio had finished playing, causing laggy transcripts in client UIs. Speculative text arrives before each audio chunk, providing text synchronized with what the bot is saying. This also simplifies the internal text handling by removing the interruption re-push hack and assistant text buffer.
    (PR #4042)

  • Updated daily-python dependency to 0.25.0.
    (PR #4047)

  • Added enable_dialout parameter to configure() in pipecat.runner.daily to support dial-out rooms. Also narrowed misleading Optional type hints and deduplicated token expiry calculation.
    (PR #4048)

  • Extended ProcessFrameResult to stop strategies, allowing a stop strategy to short-circuit evaluation of subsequent strategies by returning STOP.
    (PR #4064)

  • GradiumSTTService now takes both an encoding and sample_rate constructor argument which is assmebled in the class to form the input_format. PCM accepts 8000, 16000, and 24000 Hz sample rates.
    (PR #4066)

  • Improved GradiumSTTService transcription accuracy by reworking how text fragments are accumulated and finalized. Previously, trailing words could be dropped when the server's flushed response arrived before all text tokens were delivered. The service now uses a short aggregation delay after flush to capture trailing tokens, producing complete utterances.
    (PR #4066)

Deprecated

  • SimliVideoService.InputParams is deprecated. Use the direct constructor parameters max_session_length, max_idle_time, and enable_logging instead.
    (PR #4001)

  • Deprecated LocalSmartTurnAnalyzerV2 and LocalCoreMLSmartTurnAnalyzer. Use LocalSmartTurnAnalyzerV3 instead. Instantiating these analyzers will now emit a DeprecationWarning.
    (PR #4012)

  • Deprecated WakeCheckFilter in favor of WakePhraseUserTurnStartStrategy.
    (PR #4064)

Fixed

  • Fixed an issue where the default model for OpenAILLMService and AzureLLMService was mistakenly reverted to gpt-4o. The defaults are now restored to gpt-4.1.
    (PR #4000)

  • Fixed a race condition where EndTaskFrame could cause the pipeline to shut down before in-flight frames (e.g. LLM function call responses) finished processing. EndTaskFrame and StopTaskFrame now flow through the pipeline as ControlFrames, ensuring all pending work is flushed before shutdown begins. CancelTaskFrame and InterruptionTaskFrame remain immediate (SystemFrame).
    (PR #4006)

  • Fixed ParallelPipeline dropping or misordering frames during lifecycle synchronization. Buffered frames are now flushed in the correct order relative to synchronization frames (StartFrame goes first, EndFrame/CancelFrame go after), and frames added to the buffer during flush are also drained.
    (PR #4007)

  • Fixed TTSService potentially canceling in-flight audio during shutdown. The stop sequence now waits for all queued audio contexts to finish processing before canceling the stop frame task.
    (PR #4007)

  • Fixed Language enum values (e.g. Language.ES) not being converted to service-specific codes when passed via settings=Service.Settings(language=Language.ES) at init time. This caused API errors (e.g. 400 from Rime) because the raw enum was sent instead of the expected language code (e.g. "spa"). Runtime updates via UpdateSettingsFrame were unaffected. The fix centralizes conversion in the base TTSService and STTService classes so all services handle this consistently.
    (PR #4024)

  • Fixed DeepgramSTTService ignoring the base_url scheme when using ws:// or http://. Previously these were silently overwritten with wss:// / https://, breaking air-gapped or private deployments that don't use TLS. All scheme choices (wss://, https://, ws://, http://, or bare hostname) are now respected.
    (PR #4026)

  • Fixed LLMSwitcher.register_function() and register_direct_function() not accepting or forwarding the timeout_secs parameter.
    (PR #4037)

  • Fixed empty user transcriptions in Nova Sonic causing spurious interruptions. Previously, an empty transcription could trigger an interruption of the assistant's response even though the user hadn't actually spoken.
    (PR #4042)

  • Fixed SonioxSTTService and OpenAIRealtimeSTTService crash when language parameters contain plain strings instead of Language enum values.
    (PR #4046)

  • Fixed premature user turn stops caused by late transcriptions arriving between turns. A stale transcript from the previous turn could persist into the next turn and trigger a stop before the current turn's real transcript arrived. Stop strategies are now reset at both turn start and turn stop to prevent state from leaking across turn boundaries.
    (PR #4057)

  • Fixed raw language strings like "de-DE" silently failing when passed to TTS/STT services (e.g. ElevenLabs producing no audio). Raw strings now go through the same Language enum resolution as enum values, so regional codes like "de-DE" are properly converted to service-expected formats like "de". Unrecognized strings log a warning instead of failing silently.
    (PR #4058)

  • Fixed Deepgram STT list-type settings (keyterm, keywords, search, redact, replace) being stringified instead of passed as lists to the SDK, which caused them to be sent as literal strings (e.g. "['pipecat']") in the nWebSocket query params.
    (PR #4063)

  • ...

Read more

v0.0.105

11 Mar 01:01
7e88b13

Choose a tag to compare

Added

  • Added concurrent audio context support: CartesiaTTSService can now synthesize the next sentence while the previous one is still playing, by setting pause_frame_processing=False and routing each sentence through its own audio context queue.
    (PR #3804)

  • Added custom video track support to Daily transport. Use video_out_destinations in DailyParams to publish multiple video tracks simultaneously, mirroring the existing audio_out_destinations feature.
    (PR #3831)

  • Added ServiceSwitcherStrategyFailover that automatically switches to the next service when the active service reports a non-fatal error. Recovery policies can be implemented via the on_service_switched event handler.
    (PR #3861)

  • Added optional timeout_secs parameter to register_function() and register_direct_function() for per-tool function call timeout control, overriding the global function_call_timeout_secs default.
    (PR #3915)

  • Added cloud-audio-only recording option to Daily transport's enable_recording property.
    (PR #3916)

  • Wired up system_instruction in BaseOpenAILLMService, AnthropicLLMService, and AWSBedrockLLMService so it works as a default system prompt, matching the behavior of the Google services. This enables sharing a single LLMContext across multiple LLM services, where each service provides its own system instruction independently.

    llm = OpenAILLMService(
        api_key=os.getenv("OPENAI_API_KEY"),
        system_instruction="You are a helpful assistant.",
    )
    
    context = LLMContext()
    
    @transport.event_handler("on_client_connected")
    async def on_client_connected(transport, client):
        context.add_message({"role": "user", "content": "Please introduce yourself."})
        await task.queue_frames([LLMRunFrame()])

    (PR #3918)

  • Added vad_threshold parameter to AssemblyAIConnectionParams for configuring voice activity detection sensitivity in U3 Pro. Aligning this with external VAD thresholds (e.g., Silero VAD) prevents the "dead zone" where AssemblyAI transcribes speech that VAD hasn't detected yet.
    (PR #3927)

  • Added push_empty_transcripts parameter to BaseWhisperSTTService and OpenAISTTService to allow empty transcripts to be pushed downstream as TranscriptionFrame instead of discarding them (the default behavior). This is intended for situations where VAD fires even though the user did not speak. In these cases, it is useful to know that nothing was transcribed so that the agent can resume speaking, instead of waiting longer for a transcription.
    (PR #3930)

  • LLM services (BaseOpenAILLMService, AnthropicLLMService, AWSBedrockLLMService) now log a warning when both system_instruction and a system message in the context are set. The constructor's system_instruction takes precedence.
    (PR #3932)

  • Runtime settings updates (via STTUpdateSettingsFrame) now work for AWS Transcribe, Azure, Cartesia, Deepgram, ElevenLabs Realtime, Gradium, and Soniox STT services. Previously, changing settings at runtime only stored the new values without reconnecting.
    (PR #3946)

  • Exposed on_summary_applied event on LLMAssistantAggregator, allowing users to listen for context summarization events without accessing private members.
    (PR #3947)

  • Deepgram Flux STT settings (keyterm, eot_threshold, eager_eot_threshold, eot_timeout_ms) can now be updated mid-stream via STTUpdateSettingsFrame without triggering a reconnect. The new values are sent to Deepgram as a Configure WebSocket message on the existing connection.
    (PR #3953)

  • Added system_instruction parameter to run_inference across all LLM services, allowing callers to override the system prompt for one-shot inference calls. Used by _generate_summary to pass the summarization prompt cleanly.
    (PR #3968)

Changed

  • Audio context management (previously in AudioContextTTSService) is now built into TTSService. All WebSocket providers (cartesia, elevenlabs, asyncai, inworld, rime, gradium, resembleai) now inherit from WebsocketTTSService directly. Word-timestamp baseline is set automatically on the first audio chunk of each context instead of requiring each provider to call start_word_timestamps() in their receive loop.
    (PR #3804)

  • Daily transport now uses CustomVideoSource/CustomVideoTrack instead of VirtualCameraDevice for the default camera output, mirroring how audio already works with CustomAudioSource/CustomAudioTrack.
    (PR #3831)

  • ⚠️ Updated DeepgramSTTService to use deepgram-sdk v6. The LiveOptions class was removed from the SDK and is now provided by pipecat directly; import it from pipecat.services.deepgram.stt instead of deepgram.
    (PR #3848)

  • ServiceSwitcherStrategy base class now provides a handle_error() hook for subclasses to implement error-based switching. ServiceSwitcher defaults to ServiceSwitcherStrategyManual and strategy_type is now optional.
    (PR #3861)

  • Support for Voice Focus 2.0 models.

    • Updated aic-sdk to ~=2.1.0 to support Voice Focus 2.0 models.
    • Cleaned unused ParameterFixedError exception handling in AICFilter
      parameter setup.
      (PR #3889)
  • max_context_tokens and max_unsummarized_messages in LLMAutoContextSummarizationConfig (and deprecated LLMContextSummarizationConfig) can now be set to None independently to disable that summarization threshold. At least one must remain set.
    (PR #3914)

  • ⚠️ Removed formatted_finals and word_finalization_max_wait_time from AssemblyAIConnectionParams as these were v2 API parameters not supported in v3. Clarified that format_turns only applies to Universal-Streaming models; U3 Pro has automatic formatting built-in.
    (PR #3927)

  • Changed DeepgramTTSService to send a Clear message on interruption instead of disconnecting and reconnecting the WebSocket, allowing the connection to persist throughout the session.
    (PR #3958)

  • Re-added enhancement_level support to AICFilter with runtime FilterEnableFrame control, applying ProcessorParameter.Bypass and ProcessorParameter.EnhancementLevel together.
    (PR #3961)

  • Updated daily-python dependency from ~=0.23.0 to ~=0.24.0.
    (PR #3970)

  • Updated FishAudioTTSService default model from s1 to s2-pro, matching Fish Audio's latest recommended model for improved quality and speed.
    (PR #3973)

  • AzureSTTService region parameter is now optional when private_endpoint is provided. A ValueError is raised if neither is given, and a warning is logged if both are provided (private_endpoint takes priority).
    (PR #3974)

Deprecated

  • Deprecated AudioContextTTSService and AudioContextWordTTSService. Subclass WebsocketTTSService directly instead; audio context management is now part of the base TTSService.

    • Deprecated WordTTSService, WebsocketWordTTSService, and InterruptibleWordTTSService. Word timestamp logic is now always active in TTSService and no longer needs to be opted into via a subclass.
      (PR #3804)
  • Deprecated pipecat.services.google.llm_vertex, pipecat.services.google.llm_openai, and pipecat.services.google.gemini_live.llm_vertex modules. Use pipecat.services.google.vertex.llm, pipecat.services.google.openai.llm, and pipecat.services.google.gemini_live.vertex.llm instead. The old import paths still work but will emit a DeprecationWarning.
    (PR #3980)

Removed

  • ⚠️ Removed supports_word_timestamps parameter from TTSService.__init__(). Word timestamp logic is now always active. Remove this argument from any custom subclass super().__init__() calls.
    (PR #3804)

Fixed

  • Fixed DeepgramSTTService keepalive ping timeout disconnections. The deepgram-sdk v6 removed automatic keepalive; pipecat now sends explicit KeepAlive messages every 5 seconds, within the recommended 3–5 second interval before Deepgram's 10-second inactivity timeout.
    (PR #3848)

  • Fixed BufferError: Existing exports of data: object cannot be re-sized in AICFilter caused by holding a memoryview on the mutable audio buffer across async yield points.
    (PR #3889)

  • Fixed TTS context not being appended to the assistant message history when using TTSSpeakFrame with append_to_context=True with some TTS providers.
    (PR [#3936](https://githu...

Read more

v0.0.104

03 Mar 05:25
5940731

Choose a tag to compare

Added

  • Added TextAggregationMetricsData metric measuring the time from the first LLM token to the first complete sentence, representing the latency cost of sentence aggregation in the TTS pipeline.
    (PR #3696)

  • Added support for using strongly-typed objects instead of dicts for updating service settings at runtime.

    Instead of, say:

    await task.queue_frame(
        STTUpdateSettingsFrame(settings={"language": Language.ES})
    )

    you'd do:

    await task.queue_frame(
        STTUpdateSettingsFrame(delta=DeepgramSTTSettings(language=Language.ES))
    )

    Each service now vends strongly-typed classes like DeepgramSTTSettings representing the service's runtime-updatable settings.
    (PR #3714)

  • Added support for specifying private endpoints for Azure Speech-to-Text, enabling use in private networks behind firewalls.
    (PR #3764)

  • Added LemonSliceTransport and LemonSliceApi to support adding real-time LemonSlice Avatars to any Daily room.
    (PR #3791)

  • Added output_medium parameter to AgentInputParams and OneShotInputParams in Ultravox service to control initial output medium (text or voice) at call creation time.
    (PR #3806)

  • Added TurnMetricsData as a generic metrics class for turn detection, with e2e processing time measurement. KrispVivaTurn now emits TurnMetricsData with e2e_processing_time_ms tracking the interval from VAD speech-to-silence transition to turn completion.
    (PR #3809)

  • Added on_audio_context_interrupted() and on_audio_context_completed() callbacks to AudioContextTTSService. Subclasses can override these to perform provider-specific cleanup instead of overriding _handle_interruption().
    (PR #3814)

  • Added on_summary_applied event to LLMContextSummarizer for observability, providing message counts before and after context summarization.
    (PR #3855)

  • Added summary_message_template to LLMContextSummarizationConfig for customizing how summaries are formatted when injected into context (e.g., wrapping in XML tags).
    (PR #3855)

  • Added summarization_timeout to LLMContextSummarizationConfig (default 120s) to prevent hung LLM calls from permanently blocking future summarizations.
    (PR #3855)

  • Added optional llm field to LLMContextSummarizationConfig for routing summarization to a dedicated LLM service (e.g., a cheaper/faster model) instead of the pipeline's primary model.
    (PR #3855)

  • Add AssemblyAI u3-rt-pro model support with built-in turn detection mode
    (PR #3856)

  • Added LLMSummarizeContextFrame to trigger on-demand context summarization from anywhere in the pipeline (e.g. a function call tool). Accepts an optional config: LLMContextSummaryConfig to override summary generation settings per request.
    (PR #3863)

  • Added LLMContextSummaryConfig (summary generation params: target_context_tokens, min_messages_after_summary, summarization_prompt) and LLMAutoContextSummarizationConfig (auto-trigger thresholds: max_context_tokens, max_unsummarized_messages, plus a nested summary_config). These replace the monolithic LLMContextSummarizationConfig.
    (PR #3863)

  • Added support for the speed_alpha parameter to the arcana model in RimeTTSService.
    (PR #3873)

  • Added ClientConnectedFrame, a new SystemFrame pushed by all transports (Daily, LiveKit, FastAPI WebSocket, WebSocket Server, SmallWebRTC, HeyGen, Tavus) when a client connects. Enables observers to track transport readiness timing.
    (PR #3881)

  • Added StartupTimingObserver for measuring how long each processor's start() method takes during pipeline startup. Also measures transport readiness — the time from StartFrame to first client connection — via the on_transport_timing_report event.
    (PR #3881)

  • Added BotConnectedFrame for SFU transports and on_transport_timing_report event to StartupTimingObserver with bot and client connection timing.
    (PR #3881)

  • Added optional direction parameter to PipelineTask.queue_frame() and PipelineTask.queue_frames(), allowing frames to be pushed upstream from the end of the pipeline.
    (PR #3883)

  • Added on_latency_breakdown event to UserBotLatencyObserver providing per-service TTFB, text aggregation, user turn duration, and function call latency metrics for each user-to-bot response cycle.
    (PR #3885)

  • Added on_first_bot_speech_latency event to UserBotLatencyObserver measuring the time from client connection to first bot speech. An on_latency_breakdown is also emitted for this first speech event.
    (PR #3885)

  • Added broadcast_interruption() to FrameProcessor. This method pushes an InterruptionFrame both upstream and downstream directly from the calling processor, avoiding the round-trip through the pipeline task that push_interruption_task_frame_and_wait() required.
    (PR #3896)

Changed

  • Added text_aggregation_mode parameter to TTSService and all TTS subclasses with a new TextAggregationMode enum (SENTENCE, TOKEN). All text now flows through text aggregators regardless of mode, enabling pattern detection and tag handling in TOKEN mode.
    (PR #3696)

  • ⚠️ Refactored runtime-updatable service settings to use strongly-typed classes (TTSSettings, STTSettings, LLMSettings, and service-specific subclasses) instead of plain dicts. Each service's _settings now holds these strongly-typed objects. For service maintainers, see changes in COMMUNITY_INTEGRATIONS.md.
    (PR #3714)

  • Word timestamp support has been moved from WordTTSService into TTSService via a new supports_word_timestamps parameter. Services that previously extended WordTTSService, AudioContextWordTTSService, or WebsocketWordTTSService now pass supports_word_timestamps=True to their parent __init__ instead.
    (PR #3786)

  • Improved Ultravox TTFB measurement accuracy by using VAD speech end time instead of UserStoppedSpeakingFrame timing.
    (PR #3806)

  • Aligned UltravoxRealtimeLLMService frame handling with OpenAI/Gemini realtime services: added InterruptionFrame handling with metrics cleanup, processing metrics at response boundaries, and improved agent transcript handling for both voice and text output modalities.
    (PR #3806)

  • Updated OpenAIRealtimeLLMService default model to gpt-realtime-1.5.
    (PR #3807)

  • Added api_key parameter to KrispVivaSDKManager, KrispVivaTurn, and KrispVivaFilter for Krisp SDK v1.6.1+ licensing. Falls back to KRISP_VIVA_API_KEY environment variable.
    (PR #3809)

  • Bumped nltk minimum version from 3.9.1 to 3.9.3 to resolve a security vulnerability.
    (PR #3811)

  • ServiceSettingsUpdateFrames are now UninterruptibleFrames. Generally speaking, you don't want a user interruption to prevent a service setting change from going into effect. Note that you usually don't use ServiceSettingsUpdateFrame directly, you use one of its subclasses:

    • LLMUpdateSettingsFrame
    • TTSUpdateSettingsFrame
    • STTUpdateSettingsFrame
      (PR #3819)
  • Updated context summarization to use user role instead of assistant for summary messages.
    (PR #3855)

  • Rename AssemblyAISTTService parameter min_end_of_turn_silence_when_confident parameter to min_turn_silence (old name still supported with deprecation warning)
    (PR #3856)

  • ⚠️ Renamed LLMAssistantAggregatorParams fields: enable_context_summarizationenable_auto_context_summarization and context_summarization_configauto_context_summarization_config (now accepts LLMAutoContextSummarizationConfig). The old names still work with a DeprecationWarning for one release cycle.
    (PR #3863)

  • ElevenLabsRealtimeSTTService now sets TranscriptionFrame.finalized to True when using CommitStrategy.MANUAL.
    (PR #3865)

  • Updated numba version pin from == to >=0.61.2
    (PR #3868)

  • Updated tracing code to use ServiceSettings dataclass API (given_fields(), attribute access) instead of dict-style access (.items(), in, subscript).
    (PR [...

Read more

v0.0.103

21 Feb 00:47
b67af19

Choose a tag to compare

Added

  • Added "timestampTransportStrategy": "ASYNC" to InworldAITTSService. This allows timestamps info to trail audio chunks arrival, resulting in much better first audio chunk latency
    (PR #3625)

  • Added model-specific InputParams to RimeTTSService: arcana params (repetition_penalty, temperature, top_p) and mistv2 params (no_text_normalization, save_oovs, segment). Model, voice, and param changes now trigger WebSocket reconnection.
    (PR #3642)

  • Added write_transport_frame() hook to BaseOutputTransport allowing transport subclasses to handle custom frame types that flow through the audio queue.
    (PR #3719)

  • Added DailySIPTransferFrame and DailySIPReferFrame to the Daily transport. These frames queue SIP transfer and SIP REFER operations with audio, so the operation executes only after the bot finishes its current utterance.
    (PR #3719)

  • Added keepalive support to SarvamSTTService to prevent idle connection timeouts (e.g. when used behind a ServiceSwitcher).
    (PR #3730)

  • Added UserIdleTimeoutUpdateFrame to enable or disable user idle detection at runtime by updating the timeout dynamically.
    (PR #3748)

  • Added broadcast_sibling_id field to the base Frame class. This field is automatically set by broadcast_frame() and broadcast_frame_instance() to the ID of the paired frame pushed in the opposite direction, allowing receivers to identify broadcast pairs.
    (PR #3774)

  • Added ignored_sources parameter to RTVIObserverParams and add_ignored_source()/remove_ignored_source() methods to RTVIObserver to suppress RTVI messages from specific pipeline processors (e.g. a silent evaluation LLM).
    (PR #3779)

  • Added DeepgramSageMakerTTSService for running Deepgram TTS models deployed on AWS SageMaker endpoints via HTTP/2 bidirectional streaming. Supports the Deepgram TTS protocol (Speak, Flush, Clear, Close), interruption handling, and per-turn TTFB metrics.
    (PR #3785)

Changed

  • ⚠️ RimeTTSService now defaults to model="arcana" and the wss://users-ws.rime.ai/ws3 endpoint. InputParams defaults changed from mistv2-specific values to None — only explicitly-set params are sent as query params.
    (PR #3642)

  • AICFilter now shares read-only AIC models via a singleton AICModelManager
    in aic_filter.py.

    • Multiple filters using the same model path or (model_id, model_download_dir) share one loaded model, with reference counting and concurrent load deduplication.
    • Model file I/O runs off the event loop so the filter does not block.
      (PR #3684)
  • Added X-User-Agent and X-Request-Id headers to InworldTTSService for better traceability.
    (PR #3706)

  • DailyUpdateRemoteParticipantsFrame is no longer deprecated and is now queued with audio like other transport frames.
    (PR #3719)

  • Bumped Pillow dependency upper bound from <12 to <13 to allow Pillow 12.x.
    (PR #3728)

  • Moved STT keepalive mechanism from WebsocketSTTService to the STTService base class, allowing any STT service (not just websocket-based ones) to use idle-connection keepalive via the keepalive_timeout and keepalive_interval parameters.
    (PR #3730)

  • Improved audio context management in AudioContextTTSService by moving context ID tracking to the base class and adding reuse_context_id_within_turn parameter to control concurrent TTS request handling.

    • Added helper methods: has_active_audio_context(), get_active_audio_context_id(), remove_active_audio_context(), reset_active_audio_context()
    • Simplified Cartesia, ElevenLabs, Inworld, Rime, AsyncAI, and Gradium TTS implementations by removing duplicate context management code
      (PR #3732)
  • UserIdleController is now always created with a default timeout of 0 (disabled). The user_idle_timeout parameter changed from Optional[float] = None to float = 0 in UserTurnProcessor, LLMUserAggregatorParams, and UserIdleController.
    (PR #3748)

  • Change the version specifier from >=0.2.8 to ~=0.2.8 for the speechmatics-voice package to ensure compatibility with future patch versions.
    (PR #3761)

  • Updated InworldTTSService and InworldHttpTTSService to use ASYNC timestamp transport strategy by default
    (PR #3765)

  • Added start_time and end_time parameters to start_ttfb_metrics(), stop_ttfb_metrics(), start_processing_metrics(), and stop_processing_metrics() in FrameProcessor and FrameProcessorMetrics, allowing custom timestamps for metrics measurement. STTService now uses these instead of custom TTFB tracking.
    (PR #3776)

  • Updated default Anthropic model from claude-sonnet-4-5-20250929 to claude-sonnet-4-6.
    (PR #3792)

Deprecated

  • Deprecated unused Traceable, @traceable, @traced, and AttachmentStrategy in pipecat.utils.tracing.class_decorators. This module will be removed in a future release.
    (PR #3733)

Fixed

  • Fixed race condition where RTVIObserver could send messages before DailyTransport join completed. Outbound messages are now queued & delivered after the transport is ready.
    (PR #3615)

  • Fixed async generator cleanup in OpenAI LLM streaming to prevent AttributeError with uvloop on Python 3.12+ (MagicStack/uvloop#699).
    (PR #3698)

  • Fixed SmallWebRTCTransport input audio resampling to properly handle all sample rates, including 8kHz audio.
    (PR #3713)

  • Fixed a race condition in RTVIObserver where bot output messages could be sent before the bot-started-speaking event.
    (PR #3718)

  • Fixed Grok Realtime session.updated event parsing failure caused by the API returning prefixed voice names (e.g. "human_Ara" instead of "Ara").
    (PR #3720)

  • Fixed context ID reuse issue in ElevenLabsTTSService, InworldTTSService, RimeTTSService, CartesiaTTSService, AsyncAITTSService, and PlayHTTTSService. Services now properly reuse the same context ID across multiple run_tts() invocations within a single LLM turn, preventing context tracking issues and incorrect lifecycle signaling.
    (PR #3729)

  • Fixed word timestamp interleaving issue in ElevenLabsTTSService when processing multiple sentences within a single LLM turn.
    (PR #3729)

  • Fixed tracing service decorators executing the wrapped function twice when the function itself raised an exception (e.g., LLM rate limit, TTS timeout).
    (PR #3735)

  • Fixed LLMUserAggregator broadcasting mute events before StartFrame reaches downstream processors.
    (PR #3737)

  • Fixed UserIdleController false idle triggers caused by gaps between user and bot activity frames. The idle timer now starts only after BotStoppedSpeakingFrame and is suppressed during active user turns and function calls.
    (PR #3744)

  • Fixed incorrect sample_rate assignment in TavusInputTransport._on_participant_audio_data (was using audio.audio_frames instead of audio.sample_rate).
    (PR #3768)

  • Fixed RTVIObserver not processing upstream-only frames. Previously, all upstream frames were filtered out to avoid duplicate messages from broadcasted frames. Now only upstream copies of broadcasted frames are skipped.
    (PR #3774)

  • Fixed mutable default arguments in LLMContextAggregatorPair.__init__() that could cause shared state across instances.
    (PR #3782)

  • Fixed DeepgramSageMakerSTTService to properly track finalize lifecycle using request_finalize() / confirm_finalize() and use is_final (instead of is_final and speech_final) for final transcription detection, matching DeepgramSTTService behavior.
    (PR #3784)

  • Fixed a race condition in AudioContextTTSService where the audio context could time out between consecutive TTS requests within the same turn, causing audio to be discarded.
    (PR #3787)

  • Fixed push_interruption_task_frame_and_wait() hanging indefinitely when the InterruptionFrame does not reach the pipeline sink within the timeout. Added a timeout keyword argument to customize the wait duration.
    (PR [#3789](https://github.com...

Read more

v0.0.102

11 Feb 02:40
640940a

Choose a tag to compare

Added

  • Added ResembleAITTSService for text-to-speech using Resemble AI's streaming WebSocket API with word-level timestamps and jitter buffering for smooth audio playback.
    (PR #3134)

  • Added UserBotLatencyObserver for tracking user-to-bot response latency. When tracing is enabled, latency measurements are automatically recorded as turn.user_bot_latency_seconds attributes on OpenTelemetry turn spans.
    (PR #3355)

  • Added append_to_context parameter to TTSSpeakFrame for conditional LLM context addition.

    • Allows fine-grained control over whether text should be added to conversation context
    • Defaults to True to maintain backward compatibility
      (PR #3584)
  • Added TTS context tracking system with context_id field to trace audio generation through the pipeline.

    • TTSAudioRawFrame, TTSStartedFrame, TTSStoppedFrame now include context_id
    • AggregatedTextFrame and TTSTextFrame now include context_id
    • Enables tracking which TTS request generated specific audio chunks
      (PR #3584)
  • Added support for Inworld TTS Websocket Auto Mode for improved latency
    (PR #3593)

  • Added new frames for context summarization: LLMContextSummaryRequestFrame and LLMContextSummaryResultFrame.
    (PR #3621)

  • Added context summarization feature to automatically compress conversation history when conversation length limits (by token or message count) are reached, enabling efficient long-running conversations.

    • Configure via enable_context_summarization=True in LLMAssistantAggregatorParams
    • Customize behavior with LLMContextSummarizationConfig (max tokens, thresholds, etc.)
    • Automatically preserves incomplete function call sequences during summarization
    • See new examples:
      examples/foundational/54-context-summarization-openai.py and
      examples/foundational/54a-context-summarization-google.py
      (PR #3621)
  • Added RTVI function call lifecycle events (llm-function-call-started, llm-function-call-in-progress, llm-function-call-stopped) with configurable security levels via RTVIObserverParams.function_call_report_level. Supports per-function control over what information is exposed (DISABLED, NONE, NAME, or FULL).
    (PR #3630)

  • Added RequestMetadataFrame and metadata handling for ServiceSwitcher to ensure STT services correctly emit STTMetadataFrame when switching between services. Only the active service's metadata is propagated downstream, switching services triggers the newly active service to re-emit its metadata, and proper frame ordering is maintained at startup.
    (PR #3637)

  • Added STTMetadataFrame to broadcast STT service latency information at pipeline start.

    • STT services broadcast P99 time-to-final-segment (ttfs_p99_latency) to downstream processors
    • Turn stop strategies automatically configure their STT timeout from this metadata
    • Developers can override ttfs_p99_latency via constructor argument for custom deployments
    • Added measured P99 values for STT providers.
    • See stt-benchmark to measure latency for your configuration
      (PR #3637)
  • Added support for is_sandbox parameter in LiveAvatarNewSessionRequest to enable sandbox mode for HeyGen LiveAvatar sessions.
    (PR #3653)

  • Added support for video_settings parameter in LiveAvatarNewSessionRequest to configure video encoding (H264/VP8) and quality levels.
    (PR #3653)

  • Added OpenAIRealtimeSTTService for real-time streaming speech-to-text using OpenAI's Realtime API WebSocket transcription sessions. Supports local VAD and server-side VAD modes, noise reduction, and automatic reconnection.
    (PR #3656)

  • Added bulbul:v3-beta TTS model support for Sarvam AI with temperature control and 25 new speaker voices.
    (PR #3671)

  • Added saaras:v3 STT model support for Sarvam AI with new mode parameter (transcribe, translate, verbatim, translit, codemix) and prompt support.
    (PR #3671)

  • Added new OpenAI TTS voice options marin and cedar.
    (PR #3682)

  • Added UserMuteStartedFrame and UserMuteStoppedFrame system frames, and corresponding user-mute-started / user-mute-stopped RTVI messages, so clients can observe when mute strategies activate or deactivate.
    (PR #3687)

Changed

  • Updated all 30+ TTS service implementations to support context tracking with context_id.

    • Services now generate and propagate context IDs through TTS frames
    • Enables end-to-end tracing of TTS requests through the pipeline
      (PR #3584)
  • ⚠️ TTSService.run_tts() now requires a context_id parameter for context tracking.

    • Custom TTS service implementations must update their run_tts() signature
    • Before: async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
    • After: async def run_tts(self, text: str, context_id: str) -> AsyncGenerator[Frame, None]:
      (PR #3584)
  • Simplified context aggregators to use frame.append_to_context flag instead of tracking internal state.

    • Cleaner logic in LLMResponseAggregator and LLMResponseUniversalAggregator
    • More consistent behavior across aggregator implementations
      (PR #3584)
  • Updated timestamps to be cumulative within an agent turn, using flushCompleted message as an indication of when timestamps from the server are reset to 0
    (PR #3593)

  • Changed KokoroTTSService to use kokoro-onnx instead of kokoro as the underlying TTS engine.
    (PR #3612)

  • Improved user turn stop timing in TranscriptionUserTurnStopStrategy and TurnAnalyzerUserTurnStopStrategy.

    • Timeout now starts on VADUserStoppedSpeakingFrame for tighter, more predictable timing
    • Added support for finalized transcripts (TranscriptionFrame.finalized=True) to trigger earlier
    • Added fallback timeout for edge cases where transcripts arrive without VAD events
    • Removed InterimTranscriptionFrame handling (no longer affects timing)
      (PR #3637)
  • Improved the accuracy of the UserBotLatencyObserver and UserBotLatencyLogObserver by measuring from the time when the user actually starts speaking.
    (PR #3637)

  • ⚠️ Renamed timeout parameter to user_speech_timeout in TranscriptionUserTurnStopStrategy.
    (PR #3637)

  • Updated the VADUserStartedSpeakingFrame to include start_secs and timestamp and VADUserStoppedSpeakingFrame to include stop_secs and timestamp, removing the need to separately handle the SpeechControlParamsFrame for VADParams values.
    (PR #3637)

  • ⚠️ Renamed TranscriptionUserTurnStopStrategy to SpeechTimeoutUserTurnStopStrategy. The old name is deprecated and will be removed in a future release.
    (PR #3637)

  • AssemblyAISTTService now automatically configures optimal settings for manual turn detection when vad_force_turn_endpoint=True. This sets end_of_turn_confidence_threshold=1.0 and max_turn_silence=2000 by default, which disables model-based turn detection and reduces latency by relying on external VAD for turn endpoints. Warnings are logged if conflicting settings are detected.
    (PR #3644)

  • Upgraded the pipecat-ai-small-webrtc-prebuilt package to v2.1.0.
    (PR #3652)

  • Changed default session mode from "CUSTOM" to "LITE" in HeyGen LiveAvatar integration, with VP8 as the default video encoding.
    (PR #3653)

  • ⚠️ The default VADParams stop_secs default is changing from 0.8 seconds to 0.2 seconds. This change both simplifies the developer experience and improves the performance of STT services. With a shorter stop_secs value, STT services using a local VAD can finalize sooner, resulting in faster transcription.

    • SpeechTimeoutUserTurnStopStrategy: control how long to wait for additional user speech using user_speech_timeout (default: 0.6 sec).
    • TurnAnalyzerUserTurnStopStrategy: the turn analyzer automatically adjusts the user wait time based on the audio input.
      (PR #3659)
  • Moved interruption wait event from per-processor instance state to InterruptionFrame itself. Added InterruptionFrame.complete() to signal when the interruption has fully traversed the pipeline. Custom processors that block or consume an InterruptionFrame before it reaches the pipeline sink must call frame.complete() to avoid stalling `push_interruption_...

Read more

v0.0.101

31 Jan 07:01
7853e5c

Choose a tag to compare

Added

  • Additions for AICFilter and AICVADAnalyzer:

    • Added model downloading support to AICFilter with model_id and model_download_dir parameters.
    • Added model_path parameter to AICFilter for loading local .aicmodel files.
    • Added unit tests for AICFilter and AICVADAnalyzer.
      (PR #3408)
  • Added handling for server_content.interrupted signal in the Gemini Live service for faster interruption response in the case where there isn't already turn tracking in the pipeline, e.g. local VAD + context aggregators. When there is already turn tracking in the pipeline, the additional interruption does no harm.
    (PR #3429)

  • Added new GenesysFrameSerializer for the Genesys AudioHook WebSocket protocol, enabling bidirectional audio streaming between Pipecat pipelines and Genesys Cloud contact center.
    (PR #3500)

  • Added reached_upstream_types and reached_downstream_types read-only properties to PipelineTask for inspecting current frame filters.
    (PR #3510)

  • Added add_reached_upstream_filter() and add_reached_downstream_filter() methods to PipelineTask for appending frame types.
    (PR #3510)

  • Added UserTurnCompletionLLMServiceMixin for LLM services to detect and filter incomplete user turns. When enabled via filter_incomplete_user_turns in LLMUserAggregatorParams, the LLM outputs a turn completion marker at the start of each response: ✓ (complete), ○ (incomplete short), or ◐ (incomplete long). Incomplete turns are suppressed, and configurable timeouts automatically re-prompt the user.
    (PR #3518)

  • Added FrameProcessor.broadcast_frame_instance(frame) method to broadcast a frame instance by extracting its fields and creating new instances for each direction.
    (PR #3519)

  • PipelineTask now automatically adds RTVIProcessor and registers RTVIObserver when enable_rtvi=True (default), simplifying pipeline setup.
    (PR #3519)

  • Added RTVIProcessor.create_rtvi_observer() factory method for creating RTVI observers.
    (PR #3519)

  • Added video_out_codec parameter to TransportParams allowing configuration of the preferred video codec (e.g., "VP8", "H264", "H265") for video output in DailyTransport.
    (PR #3520)

  • Added location parameter to Google TTS services (GoogleHttpTTSService, GoogleTTSService, GeminiTTSService) for regional endpoint support.
    (PR #3523)

  • Added new PIPECAT_SMART_TURN_LOG_DATA environment variable, which causes Smart Turn input data to be saved to disk
    (PR #3525)

  • Added result_callback parameter to UserImageRequestFrame to support deferred function call results.
    (PR #3571)

  • Added function_call_timeout_secs parameter to LLMService to configure timeout for deferred function calls (defaults to 10.0 seconds).
    (PR #3571)

  • Added vad_analyzer parameter to LLMUserAggregatorParams. VAD analysis is now handled inside the LLMUserAggregator rather than in the transport, keeping voice activity detection closer to where it is consumed. The vad_analyzer on BaseInputTransport is now deprecated.

    context_aggregator = LLMContextAggregatorPair(
        context,
        user_params=LLMUserAggregatorParams(
            vad_analyzer=SileroVADAnalyzer(),
        ),
    )

    (PR #3583)

  • Added VADProcessor for detecting speech in audio streams within a pipeline. Pushes VADUserStartedSpeakingFrame, VADUserStoppedSpeakingFrame, and UserSpeakingFrame downstream based on VAD state changes.
    (PR #3583)

  • Added VADController for managing voice activity detection state and emitting speech events independently of transport or pipeline processors.
    (PR #3583)

  • Added local PiperTTSService for offline text-to-speech using Piper voice models. The existing HTTP-based service has been renamed to PiperHttpTTSService.
    (PR #3585)

  • main() in pipecat.runner.run now accepts an optional argparse.ArgumentParser, allowing bots to define custom CLI arguments accessible via runner_args.cli_args.
    (PR #3590)

  • Added KokoroTTSService for local text-to-speech synthesis using the Kokoro-82M model.
    (PR #3595)

Changed

  • Updated AICFilter and AICVADAnalyzer to use aic-sdk ~= 2.0.1.
    (PR #3408)

  • Improved the STT TTFB (Time To First Byte) measurement, reporting the delay between when the user stops speaking and when the final transcription is received. Note: Unlike traditional TTFB which measures from a discrete request, STT services receive continuous audio input—so we measure from speech end to final transcript, which captures the latency that matters for voice AI applications. In support of this change, added finalized field to TranscriptionFrame to indicate when a transcript is the final result for an utterance.
    (PR #3495)

  • SarvamSTTService now defaults vad_signals and high_vad_sensitivity to None (omitted from connection parameters), improving latency by ~300ms compared to the previous defaults.
    (PR #3495)

  • Changed frame filter storage from tuples to sets in PipelineTask.
    (PR #3510)

  • Changed default Inworld TTS model from inworld-tts-1 to inworld-tts-1.5-max.
    (PR #3531)

  • FrameSerializer now subclasses from BaseObject to enable event support.
    (PR #3560)

  • Added support for TTFS in SpeechmaticsSTTService and set the default mode to EXTERNAL to support Pipecat-controlled VAD.

    • Changed dependency to speechmatics-voice[smart]>=0.2.8
      (PR #3562)
  • ⚠️ Changed function call handling to use timeout-based completion instead of immediate callback execution.

    • Function calls that defer their results (e.g., UserImageRequestFrame) now use a timeout mechanism
    • The result_callback is invoked automatically when the deferred operation completes or after timeout
    • This change affects examples using UserImageRequestFrame - the result_callback should now be passed to the frame instead of being called immediately
      (PR #3571)
  • Pipecat runner now uses DAILY_ROOM_URL instead of DAILY_SAMPLE_ROOM_URL.
    (PR #3582)

  • Updates to GradiumSTTService:

    • Now flushes pending transcriptions when VAD detects the user stopped speaking, improving response latency.
    • GradiumSTTService now supports InputParams for configuring language and delay_in_frames settings.
      (PR #3587)

Deprecated

  • ⚠️ Deprecated vad_analyzer parameter on BaseInputTransport. Pass vad_analyzer to LLMUserAggregatorParams instead or use VADProcessor in the pipeline.
    (PR #3583)

Removed

  • Removed deprecated AICFilter parameters: enhancement_level, voice_gain, noise_gate_enable.
    (PR #3408)

Fixed

  • Fixed an issue where if you were using OpenRouterLLMService with a Gemini model, it wouldn't handle multiple "system" messages as expected (and as we do in GoogleLLMService), which is to convert subsequent ones into "user" messages. Instead, the latest "system" message would overwrite the previous ones.
    (PR #3406)

  • Transports now properly broadcast InputTransportMessageFrame frames both upstream and downstream instead of only pushing downstream.
    (PR #3519)

  • Fixed FrameProcessor.broadcast_frame() to deep copy kwargs, preventing shared mutable references between the downstream and upstream frame instances.
    (PR #3519)

  • Fixed OpenAI LLM services to emit ErrorFrame on completion timeout, enabling proper error handling and LLMSwitcher failover.
    (PR #3529)

  • Fixed a logging issue where non-ASCII characters (e.g., Japanese, Chinese, etc.) were being unnecessarily escaped to Unicode sequences when function call occurred.
    (PR #3536)

  • Fixed how audio tracks are synchronized inside the AudioBufferProcessor to fix timing issues where silence and audio were misaligned between user and bot buffers.
    (PR #3541)

  • Fixed race condition in OpenAIRealtimeBetaLLMService that could cause an error when truncating the conversation....

Read more

v0.0.100

21 Jan 03:37
768d395

Choose a tag to compare

Added

  • Added Hathora service to support Hathora-hosted TTS and STT models (only non-streaming)
    (PR #3169)

  • Added CambTTSService, using Camb.ai's TTS integration with MARS models (mars-flash, mars-pro, mars-instruct) for high-quality text-to-speech synthesis.
    (PR #3349)

  • Added the additional_headers param to WebsocketClientParams, allowing WebsocketClientTransport to send custom headers on connect, for cases such as authentication.
    (PR #3461)

  • Added UserIdleController for detecting user idle state, integrated into LLMUserAggregator and UserTurnProcessor via optional user_idle_timeout parameter. Emits on_user_turn_idle event for application-level handling. Deprecated UserIdleProcessor in favor of the new compositional approach.
    (PR #3482)

  • Added on_user_mute_started and on_user_mute_stopped event handlers to LLMUserAggregator for tracking user mute state changes.
    (PR #3490)

Changed

  • Enhanced interruption handling in AsyncAITTSService by supporting multi-context WebSocket sessions for more robust context management.
    (PR #3287)

  • Throttle UserSpeakingFrame to broadcast at most every 200ms instead of on every audio chunk, reducing frame processing overhead during user speech.
    (PR #3483)

Deprecated

  • For consistency with other package names, we just deprecated pipecat.turns.mute (introduced in Pipecat 0.0.99) in favor of pipecat.turns.user_mute.
    (PR #3479)

Fixed

  • Corrected TTFB metric calculation in AsyncAIHttpTTSService.
    (PR #3287)

  • Fixed an issue where the "bot-llm-text" RTVI event would not fire for realtime (speech-to-speech) services:

    • AWSNovaSonicLLMService
    • GeminiLiveLLMService
    • OpenAIRealtimeLLMService
    • GrokRealtimeLLMService

    The issue was that these services weren't pushing LLMTextFrames. Now they do.
    (PR #3446)

  • Fixed an issue where on_user_turn_stop_timeout could fire while a user is talking when using ExternalUserTurnStrategies.
    (PR #3454)

  • Fixed an issue where user turn start strategies were not being reset after a user turn started, causing incorrect strategy behavior.
    (PR #3455)

  • Fixed MinWordsUserTurnStartStrategy to not aggregate transcriptions, preventing incorrect turn starts when words are spoken with pauses between them.
    (PR #3462)

  • Fixed an issue where Grok Realtime would error out when running with SmallWebRTC transport.
    (PR #3480)

  • Fixed a Mem0MemoryService issue where passing async_mode: true was causing an error. See https://docs.mem0.ai/platform/features/async-mode-default-change.
    (PR #3484)

  • Fixed AWSNovaSonicLLMService.reset_conversation(), which would previously error out. Now it successfully reconnects and "rehydrates" from the context object.
    (PR #3486)

  • Fixed AzureTTSService transcript formatting issues:

    • Punctuation now appears without extra spaces (e.g., "Hello!" instead of "Hello !")
    • CJK languages (Chinese, Japanese, Korean) no longer have unwanted spaces between characters
      (PR #3489)
  • Fixed an issue where UninterruptibleFrame frames would not be preserved in some cases.
    (PR #3494)

  • Fixed memory leak in LiveKitTransport when video_in_enabled is False.
    (PR #3499)

  • Fixed an issue in AIService where unhandled exceptions in start(), stop(), or cancel() implementations would prevent process_frame() to continue and therefore StartFrame, EndFrame, or CancelFrame from being pushed downstream, causing the pipeline to not start or stop properly.
    (PR #3503)

  • Moved NVIDIATTSService and NVIDIASTTService client initialization from constructor to start() for better error handling.
    (PR #3504)

  • Optimized NVIDIATTSService to process incoming audio frames immediately.
    (PR #3509)

  • Optimized NVIDIASTTService by removing unnecessary queue and task.
    (PR #3509)

  • Fixed a CambTTSService issue where client was being initialized in the constructor which wouldn't allow for proper Pipeline error handling.
    (PR #3511)