Skip to content

fix(workflow): reject non-JoinNode fan-in at validation time#1041

Open
wolo-lab wants to merge 2 commits into
v2from
wolo/v2-h6-fanin-validation
Open

fix(workflow): reject non-JoinNode fan-in at validation time#1041
wolo-lab wants to merge 2 commits into
v2from
wolo/v2-h6-fanin-validation

Conversation

@wolo-lab

Copy link
Copy Markdown
Contributor

Problem

In a diamond like Start → A, A → B, A → C, B → D, C → D where D is a plain node (not a JoinNode), D has two predecessors. When B and C each complete, the scheduler schedules D twice.
The scheduler keys its per-node bookkeeping (runsByName / runCancels) by node name and has no "already running" guard, so the second activation of D overwrites the first while it's still running:

  • the first activation's cancel func is lost (leak / uncancellable),
  • both activations accumulate into the same per-name run, mixing outputs → ErrMultipleOutputs / wrong output.
    JoinNode fan-in is safe (it's barrier-gated — it waits for all predecessors and gets the aggregated input), but nothing stopped a plain node from being used as a fan-in target, so the corruption happened silently.

How adk-python handles it (for context)

adk-python's v2 graph engine allows plain fan-in: it buffers a trigger per predecessor and serializes per-node execution (_schedule_ready_nodes skips a node already RUNNING), so a plain fan-in node runs once per trigger, never concurrently. adk-go has no such per-node serialization yet.

Solution

Rather than serialization, reject the unsupported graph at construction with a clear, honest error (ErrUnsupportedFanIn: "non-JoinNode fan-in is not yet supported ... use a JoinNode to converge branches"). This turns silent state corruption into a fail-fast build error, and the message signals it's a temporary limitation. When per-node serialization lands later, the check can be removed.

The check counts only unconditional (Route == nil) incoming edges, so it rejects exactly the diamond case while leaving legitimate graphs alone:

  • loop-back (a conditional back-edge gives a node a 2nd predecessor, but they don't fire together) — allowed,
  • conditional fan-in (only one branch routes in at a time) — allowed,
  • JoinNode fan-in — exempt (barrier-gated).

A plain (non-Join) node with two or more unconditional incoming edges is
activated once per completed predecessor. The scheduler keys bookkeeping
(runsByName / runCancels) by node name with no already-running guard, so
the second activation overwrites the first while it is still running:
mixed outputs (ErrMultipleOutputs / wrong output) and a lost cancel func.

adk-python serializes per-node executions (skip-if-RUNNING) and allows
plain fan-in, but adk-go has no such serialization yet. Until it does,
reject the graph at construction with a clear "not yet supported" error
instead of corrupting the run.

Only unconditional (Route == nil) incoming edges are counted, so
conditional fan-in and loop-back back-edges — whose predecessors don't
all fire together — are not rejected. JoinNode (barrier-gated) fan-in is
exempt.

Add validation tests (diamond rejected; JoinNode diamond, loop-back, and
conditional fan-in allowed) plus an e2e check that New() surfaces it.
@wolo-lab wolo-lab requested a review from hanorik June 15, 2026 21:27
@wolo-lab wolo-lab marked this pull request as ready for review June 15, 2026 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants