FIFO workflow completion, durable replay queue, and failPendingTasks#211
FIFO workflow completion, durable replay queue, and failPendingTasks#211sethconvex wants to merge 1 commit intoexecutor-mode-corefrom
Conversation
Core executor performance improvements: - CLAIM_LIMIT=50 (matches MAX_CONCURRENCY) so executors re-query the task queue frequently, picking up later steps for earlier workflows instead of always grabbing step-0 tasks for newer workflows. - Task queue indexed by [shard, workflowCreatedAt] (ascending) so tasks for earlier-created workflows are always claimed first. This gives FIFO completion ordering: the first 2k of 20k workflows finish in ~4 min median while the last 2k take ~20 min. - 50 concurrency × 100 shards = 5000 concurrent slots. This works because real-world throughput is gated by the LLM API (Anthropic), not local compute. Higher per-shard concurrency wastes V8 memory (64 MB limit) holding idle HTTP connections. - Durable replay queue: replaces fire-and-forget scheduler.runAfter safety net with a persistent replayQueue table. Entries are inserted atomically with result recording and only deleted after successful replay. Eliminates permanently stuck workflows. - failPendingTasks mutation: force-fails all queued tasks in a shard, marks steps as failed, and inserts replay entries so workflows complete (as failures) rather than getting stuck forever. - bumpEpoch mutation: stops running executors without starting new ones. Executors detect the stale epoch and drain gracefully. - clearReplayQueue/clearTaskQueue: shard-indexed bulk cleanup for operational recovery. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
commit: |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |

Summary
Core executor performance improvements for high-throughput workflow processing:
FIFO completion ordering: Task queue indexed by
[shard, workflowCreatedAt]so earlier-created workflows complete first.CLAIM_LIMITmatchesMAX_CONCURRENCY(both 50) so executors re-query frequently, picking up later steps for earlier workflows instead of always grabbing step-0 tasks for newer ones.Concurrency tuning: 50 concurrency × 100 shards = 5,000 concurrent slots. This works because real-world throughput is gated by the LLM API (Anthropic), not local compute — higher per-shard concurrency wastes V8 memory (64 MB limit per action) holding idle HTTP connections and response buffers.
Durable replay queue: Replaces fire-and-forget
scheduler.runAftersafety net with a persistentreplayQueuetable. Entries are inserted atomically with result recording and only deleted after successful replay. Invariant: a workflow withrunResult=nulland no in-progress steps always has a row inreplayQueue. Eliminates permanently stuck workflows.failPendingTasks: Force-fails all queued tasks in a shard, marks steps as failed, inserts replay entries so workflows complete (as failures) rather than getting stuck. Useful for operational cleanup after crashes.bumpEpoch: Stops running executors without starting new ones. Executors detect the stale epoch, drain in-flight tasks, and exit gracefully.clearReplayQueue/clearTaskQueue: Shard-indexed bulk cleanup for operational recovery.Benchmark results (20k real Claude Haiku workflows)
Test plan
failPendingTaskstested for operational cleanup of stale queuesbumpEpochtested for stopping executors without starting new ones🤖 Generated with Claude Code