Skip to content

Stream LLM output incrementally to the UI #730

@dsapandora

Description

@dsapandora

Problem Statement

Today, when the product calls an LLM and the model produces a long or slow reply, users do not see any of that text until the full response is finished. There is no way in the UI to read the model's output as it is being generated — only the final, complete result is surfaced.

That creates a poor experience for longer answers: the interface looks idle or stuck while work is happening, and users cannot start reading or scanning partial results early, which is what they expect from modern LLM-powered interfaces.

Proposed Solution

Stream LLM output incrementally to the UI so users can read tokens as they are generated, instead of waiting for the full response.

This is a substantial cross-module change, not a localized fix. Any solution will have to address:

  • All LLM nodes under nodes/src/nodes/llm_* (16+ providers: llm_anthropic, llm_openai, llm_openai_api, llm_vertex, llm_gemini, llm_mistral, llm_vision_mistral, llm_perplexity, llm_xai, llm_ollama, llm_vision_ollama, llm_bedrock, llm_deepseek, llm_qwen, llm_gmi_cloud, llm_ibm_watson). Each currently returns a single complete result; each will need to support partial output.
  • A streaming execution pattern for non-agent nodes, which does not exist today. Generic nodes return once at end-of-execution; only agents emit partial state. This is a new node-execution capability.
  • The event/notification contract between server and clients (client-typescript, client-python, client-mcp) needs a new event shape for incremental output (chunk identity, ordering, completion, error semantics).
  • chat-ui and vscode message rendering, which today renders messages atomically — incremental rendering, reconciliation, and cancellation behavior are new UX work.
  • Engine throughput considerations — partial-output notifications happen at much higher frequency than current event types, so backpressure and ordering guarantees need to be validated.
  • Pipeline contract — a decision is needed about how partial output relates to lane output consumed by downstream nodes, and that decision must be documented in ROCKETRIDE_PIPELINE_RULES.md and ROCKETRIDE_COMPONENT_REFERENCE.md so pipeline authors aren't surprised.

The sendSSE() notification path used today by agents (agent_crewai, agent_deepagent, agent_langchain) is a useful reference point but is not a drop-in solution: it was designed for low-frequency agent state events (thinking, acting), not per-token streams from generic nodes.

Out of Scope

  • Streaming tool-call deltas (tool argument streaming).
  • Persisting chunk-level history — the final node output remains the canonical record.
  • Adding a new HTTP transport endpoint parallel to existing notifications.

Alternatives Considered

No response

Affected Modules

  • server (C++ engine)
  • client-typescript
  • client-python
  • client-mcp
  • nodes (pipeline)
  • ai
  • chat-ui
  • dropper-ui
  • vscode
  • tika

Metadata

Metadata

Assignees

Labels

featureNew feature or enhancement

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions