Skip to content

FIFO workflow completion, durable replay queue, and failPendingTasks#211

Open
sethconvex wants to merge 1 commit intoexecutor-mode-corefrom
executor-perf-impl
Open

FIFO workflow completion, durable replay queue, and failPendingTasks#211
sethconvex wants to merge 1 commit intoexecutor-mode-corefrom
executor-perf-impl

Conversation

@sethconvex
Copy link
Copy Markdown
Contributor

@sethconvex sethconvex commented Feb 26, 2026

Summary

Core executor performance improvements for high-throughput workflow processing:

  • FIFO completion ordering: Task queue indexed by [shard, workflowCreatedAt] so earlier-created workflows complete first. CLAIM_LIMIT matches MAX_CONCURRENCY (both 50) so executors re-query frequently, picking up later steps for earlier workflows instead of always grabbing step-0 tasks for newer ones.

  • Concurrency tuning: 50 concurrency × 100 shards = 5,000 concurrent slots. This works because real-world throughput is gated by the LLM API (Anthropic), not local compute — higher per-shard concurrency wastes V8 memory (64 MB limit per action) holding idle HTTP connections and response buffers.

  • Durable replay queue: Replaces fire-and-forget scheduler.runAfter safety net with a persistent replayQueue table. Entries are inserted atomically with result recording and only deleted after successful replay. Invariant: a workflow with runResult=null and no in-progress steps always has a row in replayQueue. Eliminates permanently stuck workflows.

  • failPendingTasks: Force-fails all queued tasks in a shard, marks steps as failed, inserts replay entries so workflows complete (as failures) rather than getting stuck. Useful for operational cleanup after crashes.

  • bumpEpoch: Stops running executors without starting new ones. Executors detect the stale epoch, drain in-flight tasks, and exit gracefully.

  • clearReplayQueue/clearTaskQueue: Shard-indexed bulk cleanup for operational recovery.

Benchmark results (20k real Claude Haiku workflows)

19,886 completed | 114 failed (0.57%) | 0 stuck

Timing:
  p50  = 12.1 min
  p90  = 20.8 min
  p99  = 22.1 min
  slow = 26.6 min

Priority ordering (creation-time deciles):
  Earliest 2k workflows: median 4.3 min to complete
  Latest 2k workflows:   median 20.1 min to complete

Test plan

  • 10k real Claude Haiku benchmark: 9,968 completed, 32 failed, 0 stuck
  • 20k real Claude Haiku benchmark: 19,886 completed, 114 failed, 0 stuck
  • Priority analysis confirms monotonically increasing completion time by creation order
  • failPendingTasks tested for operational cleanup of stale queues
  • bumpEpoch tested for stopping executors without starting new ones

🤖 Generated with Claude Code

Core executor performance improvements:

- CLAIM_LIMIT=50 (matches MAX_CONCURRENCY) so executors re-query the
  task queue frequently, picking up later steps for earlier workflows
  instead of always grabbing step-0 tasks for newer workflows.

- Task queue indexed by [shard, workflowCreatedAt] (ascending) so
  tasks for earlier-created workflows are always claimed first.
  This gives FIFO completion ordering: the first 2k of 20k workflows
  finish in ~4 min median while the last 2k take ~20 min.

- 50 concurrency × 100 shards = 5000 concurrent slots. This works
  because real-world throughput is gated by the LLM API (Anthropic),
  not local compute. Higher per-shard concurrency wastes V8 memory
  (64 MB limit) holding idle HTTP connections.

- Durable replay queue: replaces fire-and-forget scheduler.runAfter
  safety net with a persistent replayQueue table. Entries are inserted
  atomically with result recording and only deleted after successful
  replay. Eliminates permanently stuck workflows.

- failPendingTasks mutation: force-fails all queued tasks in a shard,
  marks steps as failed, and inserts replay entries so workflows
  complete (as failures) rather than getting stuck forever.

- bumpEpoch mutation: stops running executors without starting new
  ones. Executors detect the stale epoch and drain gracefully.

- clearReplayQueue/clearTaskQueue: shard-indexed bulk cleanup for
  operational recovery.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor Author

sethconvex commented Feb 26, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new bot commented Feb 26, 2026

Open in StackBlitz

npm i https://pkg.pr.new/get-convex/workflow/@convex-dev/workflow@211

commit: a634363

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 26, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch executor-perf-impl

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant