How should we stream LLM tokens to the UI? Evaluating three designs for #730 #752
Replies: 7 comments 1 reply
-
Alternative A — Reuse
|
Beta Was this translation helpful? Give feedback.
-
Alternative B — New
|
Beta Was this translation helpful? Give feedback.
-
Alternative C — Dedicated token-bus outside the lane graphStreaming bus parallel to the pipeline, indexed by flowchart TB
subgraph Engine
Pipe[Pipeline cpp]
Bus[("TokenBus<br/>runId+nodeId<br/>ring buffer")]
end
LLM["llm_ providers"] -->|emit chunk seq text| Bus
LLM -->|lane output full_text| Pipe
Pipe --> Down[downstream nodes]
Bus -->|websocket stream runId| Client
Client --> UI
sequenceDiagram
participant UI
participant Client
participant WS as Engine stream WS
participant Engine as Engine pipeline
participant Bus as TokenBus
participant Node
participant Provider
UI->>Client: subscribeStream runId
Client->>WS: open WS stream runId
Node->>Provider: chat stream
loop deltas
Provider-->>Node: delta
Node->>Bus: publish runId nodeId seq delta
Bus->>WS: push
WS->>Client: frame
Client->>UI: append
end
Node-->>Engine: lane output canonical
Bus->>WS: end nodeId
WS->>Client: end
Client->>UI: done
Pros
Cons
|
Beta Was this translation helpful? Give feedback.
-
|
Event shape (shared by A and B) {
"type": "llm.chunk", // discriminator
"runId": "…",
"nodeId": "llm_anthropic#7",
"seq": 42, // monotonic per (runId, nodeId)
"text": "fragment", // delta, not cumulative
"finishReason": null, // null | "stop" | "length" | "error"
"ts": 1714857600123
}Rules:
|
Beta Was this translation helpful? Give feedback.
-
|
Alternative C is the ideal way to go, but it is time taking and too much of a commitment. That being said I do not know how soon or late Rocket Ride wants to ship this feature. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
I think option A is the best option. SSE was designed to be low impact using the websocket interface. It already is asssigned per pipeline so SSE events work correctly even if you have 100 concurrent requests in flight. As for back backpressure, I don't think will be much of a problem as the SSE "packets" ill be relatively small and be sent at the token generation rate, which is not great. Currently, we are sending whole video frames back to the client via SSE, so text will not be a problem. Keep in mind, even though it is called SSE and fundamentally doing the same thing as HTTP SSE, the mechanism is very different on how it ends up back on the client. As for ordering, we are pretty much guaranteed the output is in the order that the node output the data. So, the node may have to do interlocks to ensure the SSE calls are output in the correct order, but once output, it will be received exactly as sent |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Today, the
llm_*nodes (16+ providers undernodes/src/nodes/llm_*) call the provider synchronously and only emit their output once the full response is available. Onlyagent_*nodes emit partial events, and they do so viasendSSE('thinking', ...)— a channel designed for low-frequency state signals, not per-token streams.This document presents three architectural alternatives, their diagrams, and the tradeoffs of each. The intent is to pick one before starting the cross-module implementation.
2. Current architecture (before)
sequenceDiagram participant UI as chat-ui / vscode participant Client as client SDK participant Engine as Engine C++ Pipeline participant Node as llm_anthropic participant Provider as Anthropic API UI->>Client: ask question Client->>Engine: submit pipeline run Engine->>Node: writeQuestions Node->>Provider: chat invoke (blocking) Note over Provider: model generates<br/>tokens internally Provider-->>Node: full response Node-->>Engine: emit answers lane (single write) Engine-->>Client: notification lane output Client-->>UI: render message (atomic)Key property: the UI receives nothing until the node finishes. Only
agent_*emitssendSSE('thinking', ...)during execution, but those events are state signals, not tokens.Beta Was this translation helpful? Give feedback.
All reactions