How should we stream LLM tokens to the UI? Evaluating three designs for #730 #752

dsapandora · 2026-05-04T14:56:48Z

dsapandora
May 4, 2026
Collaborator

Today, the llm_* nodes (16+ providers under nodes/src/nodes/llm_*) call the provider synchronously and only emit their output once the full response is available. Only agent_* nodes emit partial events, and they do so via sendSSE('thinking', ...) — a channel designed for low-frequency state signals, not per-token streams.

This document presents three architectural alternatives, their diagrams, and the tradeoffs of each. The intent is to pick one before starting the cross-module implementation.

2. Current architecture (before)

sequenceDiagram
    participant UI as chat-ui / vscode
    participant Client as client SDK
    participant Engine as Engine C++ Pipeline
    participant Node as llm_anthropic
    participant Provider as Anthropic API

    UI->>Client: ask question
    Client->>Engine: submit pipeline run
    Engine->>Node: writeQuestions
    Node->>Provider: chat invoke (blocking)
    Note over Provider: model generates<br/>tokens internally
    Provider-->>Node: full response
    Node-->>Engine: emit answers lane (single write)
    Engine-->>Client: notification lane output
    Client-->>UI: render message (atomic)

Key property: the UI receives nothing until the node finishes. Only agent_* emits sendSSE('thinking', ...) during execution, but those events are state signals, not tokens.

dsapandora · 2026-05-04T14:58:09Z

dsapandora
May 4, 2026
Collaborator Author

Alternative A — Reuse `sendSSE` with a new `chunk` event type

Keep the same SSE transport already used by agents, but add a new event.type = "chunk" for incremental tokens.

sequenceDiagram
    participant UI
    participant Client
    participant Engine
    participant Node as llm_anthropic
    participant Provider

    UI->>Client: ask question
    Client->>Engine: submit
    Engine->>Node: writeQuestions
    Node->>Provider: chat stream
    loop per delta
        Provider-->>Node: token delta
        Node-->>Engine: sendSSE chunk seq=N text runId nodeId
        Engine-->>Client: SSE chunk
        Client-->>UI: append delta
    end
    Provider-->>Node: stop
    Node-->>Engine: emit answers lane (full text canonical)
    Engine-->>Client: lane output final
    Client-->>UI: reconcile and mark done

Pros

Reuses existing transport; minimal engine change.
Nodes don't need a new execution capability.
Compatible with the rule "lane output remains canonical".

Cons

sendSSE was designed for low frequency. At 50–200 chunks/sec per node, it can saturate Monitor/socket without backpressure.
No ordering guarantee if the engine reorders across threads.
Mixing thinking and chunk on the same channel complicates client-side
filtering.

0 replies

dsapandora · 2026-05-04T15:00:04Z

dsapandora
May 4, 2026
Collaborator Author

Alternative B — New `partial` lane separate from the `output` lane

The node emits chunks through a new side lane (partial/stream). The final output lane stays untouched and remains the only source consumed by downstream nodes.

flowchart LR
    subgraph Pipeline
        LLM[llm_anthropic]
        Down[downstream node]
    end

    subgraph Streams
        Partial["lane partial<br/>ordered chunks"]
        Output["lane output<br/>full text"]
    end

    LLM -->|deltas high freq| Partial
    LLM -->|full response one write| Output
    Output --> Down
    Partial -.->|client only observable| Client[client]
    Client --> UI

sequenceDiagram
    participant UI
    participant Client
    participant Engine
    participant Node
    participant Provider

    Client->>Engine: subscribe runId lanes partial output
    Node->>Provider: chat stream
    loop deltas
        Provider-->>Node: delta
        Node-->>Engine: write lane partial seq text ts
        Engine-->>Client: lane event backpressured
        Client-->>UI: append
    end
    Node-->>Engine: write lane output full_text
    Engine-->>Client: lane event output
    Client-->>UI: reconcile and done

Pros

Keeps the pipeline contract clean: output is canonical, partial is observation-only. The rule to document in ROCKETRIDE_PIPELINE_RULES.md becomes natural ("downstream never reads partial").
The engine can apply per-lane backpressure independently.
Clients can subscribe selectively (UI yes, downstream agents no).

Cons

Requires extending the lane model in the C++ engine (new observable-only lane type).
More invasive in Pipeline.cpp and in the client SDKs.

0 replies

dsapandora · 2026-05-04T15:01:03Z

dsapandora
May 4, 2026
Collaborator Author

Alternative C — Dedicated token-bus outside the lane graph

Streaming bus parallel to the pipeline, indexed by (runId, nodeId). The UI subscribes to it directly; the output lane stays exactly as it is today.

flowchart TB
    subgraph Engine
        Pipe[Pipeline cpp]
        Bus[("TokenBus<br/>runId+nodeId<br/>ring buffer")]
    end

    LLM["llm_ providers"] -->|emit chunk seq text| Bus
    LLM -->|lane output full_text| Pipe
    Pipe --> Down[downstream nodes]

    Bus -->|websocket stream runId| Client
    Client --> UI

sequenceDiagram
    participant UI
    participant Client
    participant WS as Engine stream WS
    participant Engine as Engine pipeline
    participant Bus as TokenBus
    participant Node
    participant Provider

    UI->>Client: subscribeStream runId
    Client->>WS: open WS stream runId
    Node->>Provider: chat stream
    loop deltas
        Provider-->>Node: delta
        Node->>Bus: publish runId nodeId seq delta
        Bus->>WS: push
        WS->>Client: frame
        Client->>UI: append
    end
    Node-->>Engine: lane output canonical
    Bus->>WS: end nodeId
    WS->>Client: end
    Client->>UI: done

Pros

Fully isolates high-frequency traffic from the pipeline → zero risk to engine throughput.
Natural backpressure via ring buffer + client.
Native cancellation (closing the WS = cancel).

Cons

Requires a new transport endpoint — out of scope per the issue ("Out of Scope: Adding a new HTTP transport endpoint parallel to existing notifications").
Larger operational surface (auth, rate-limit, reconnect).

0 replies

dsapandora · 2026-05-04T15:02:32Z

dsapandora
May 4, 2026
Collaborator Author

Event shape (shared by A and B)

{
  "type": "llm.chunk",            // discriminator
  "runId": "…",
  "nodeId": "llm_anthropic#7",
  "seq": 42,                       // monotonic per (runId, nodeId)
  "text": "fragment",              // delta, not cumulative
  "finishReason": null,            // null | "stop" | "length" | "error"
  "ts": 1714857600123
}

Rules:

seq is always increasing and contiguous. A client that detects a gap
reconciles against the final output.
text is a delta, not a running total.
Close event: {type:"llm.chunk", finishReason:"stop", text:""}.

0 replies

AditM99 · 2026-05-04T17:34:53Z

AditM99
May 4, 2026

Alternative C is the ideal way to go, but it is time taking and too much of a commitment.
Alternative B is the better approach for right now.

That being said I do not know how soon or late Rocket Ride wants to ship this feature.
If time is not an issue I can start working with Alternative C.

0 replies

2026-05-04T17:35:09Z

github-actions[bot]
Bot May 4, 2026

0 replies

Rod-Christensen · 2026-05-05T15:32:15Z

Rod-Christensen
May 5, 2026
Maintainer

I think option A is the best option. SSE was designed to be low impact using the websocket interface. It already is asssigned per pipeline so SSE events work correctly even if you have 100 concurrent requests in flight. As for back backpressure, I don't think will be much of a problem as the SSE "packets" ill be relatively small and be sent at the token generation rate, which is not great. Currently, we are sending whole video frames back to the client via SSE, so text will not be a problem.

Keep in mind, even though it is called SSE and fundamentally doing the same thing as HTTP SSE, the mechanism is very different on how it ends up back on the client.

As for ordering, we are pretty much guaranteed the output is in the order that the node output the data. So, the node may have to do interlocks to ensure the SSE calls are output in the correct order, but once output, it will be received exactly as sent

1 reply

AditM99 May 5, 2026

Okay
Thanks for clarification
Starting to work on it with approach A

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should we stream LLM tokens to the UI? Evaluating three designs for #730 #752

Uh oh!

{{title}}

Uh oh!

Replies: 7 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How should we stream LLM tokens to the UI? Evaluating three designs for #730 #752

Uh oh!

dsapandora May 4, 2026 Collaborator

2. Current architecture (before)

Replies: 7 comments · 1 reply

Uh oh!

dsapandora May 4, 2026 Collaborator Author

Alternative A — Reuse sendSSE with a new chunk event type

Uh oh!

dsapandora May 4, 2026 Collaborator Author

Alternative B — New partial lane separate from the output lane

Uh oh!

Uh oh!

dsapandora May 4, 2026 Collaborator Author

Alternative C — Dedicated token-bus outside the lane graph

Uh oh!

dsapandora May 4, 2026 Collaborator Author

Uh oh!

AditM99 May 4, 2026

Uh oh!

github-actions[bot] Bot May 4, 2026

Uh oh!

Rod-Christensen May 5, 2026 Maintainer

Uh oh!

AditM99 May 5, 2026

dsapandora
May 4, 2026
Collaborator

Replies: 7 comments 1 reply

dsapandora
May 4, 2026
Collaborator Author

Alternative A — Reuse `sendSSE` with a new `chunk` event type

dsapandora
May 4, 2026
Collaborator Author

Alternative B — New `partial` lane separate from the `output` lane

dsapandora
May 4, 2026
Collaborator Author

dsapandora
May 4, 2026
Collaborator Author

AditM99
May 4, 2026

github-actions[bot]
Bot May 4, 2026

Rod-Christensen
May 5, 2026
Maintainer