Skip to content

Session concurrency overwhelms sandbox provisioning — no admission control #143

@rita-aga

Description

@rita-aga

Problem

When multiple CurationJobs are submitted simultaneously, all sessions spawn and attempt to provision E2B sandboxes in parallel. The sandbox provisioner has no backpressure, concurrency limiting, or queuing — it fires HTTP POSTs to the E2B API directly from the WASM action handler. Combined with no admission control on the CurationJob entity, this causes resource starvation and liveness timeouts.

Evidence

10 regeneration jobs submitted simultaneously on Railway production:

Session Seq Nr Turns Sandbox? Result
Chapterglow 96 14 sb-WtnVkEEzNNuhdt6NbQ6J07 Completed
Flat Illustration 131 20 sb-27eKYqeST9o6qfDbhzzYrX Completed
City Pop 102 15 sb-UUHuQRgBUFitlZOxS45g9a Completed
WhimsiCollage 53 7 sb-huQpisC7RbhEIezftyhl0n Completed
Neumorphic 5 0 NONE Failed (30m timeout)
Watercolor 5 0 NONE Failed (30m timeout)
Anime Scenic 10 0 NONE Failed (30m timeout)
Neo Kawaii 14 1 NONE Failed (30m timeout)
Kawaii Watercolor 19 2 NONE Failed (30m timeout)
Shibuya 5 0 NONE Failed (race condition, see #142)

Pattern

  • 4 sessions got sandboxes and completed — these were first to reach E2B
  • 6 sessions never got sandboxes — starved at the E2B API level
  • Progressive degradation: 2 never prepared context (seq=5), 1 prepared context but never got a turn (seq=10), 2 got 1-2 turns but couldn't proceed without sandbox (seq=14, 19)
  • All 6 hit the CurationJob Running state liveness timeout (30 minutes, reset_on = ["RecordProgress"])

Root Cause Analysis

Three missing controls compound:

1. No admission control on CurationJob

curation_job.ioa.toml has no [admission] block. All 10 Submit actions dispatch immediately and spawn sessions in parallel. No queuing, no concurrency cap.

File: katagami-curation/specs/curation_job.ioa.toml

2. No sandbox provisioner backpressure

sandbox_provisioner/src/lib.rs makes a direct HTTP POST to E2B /sandboxes from within the WASM action handler. No local rate limiting, no retry with backoff, no concurrency semaphore. When 10 sessions hit E2B simultaneously, the API rate-limits or times out on later requests.

File: temper-agent/wasm/sandbox_provisioner/src/lib.rs

3. Liveness timeout doesn't account for queued/waiting sessions

The CurationJob Running state has after_seconds = 1800 with reset_on = ["RecordProgress"]. Sessions that are waiting for sandbox provisioning or LLM API capacity never call RecordProgress, so they're indistinguishable from truly stuck sessions. The liveness timeout kills them.

Proposed Fix

Short-term: Admission control on CurationJob

Add an [admission] block to gate the Submit action:

[admission]
max_concurrent_actions = { "Submit" = 4 }
queue_timeout_seconds = 1800

This serializes at the Temper level — only 4 sessions spawn at a time, remaining jobs queue until capacity frees up.

Medium-term: Sandbox provisioner concurrency control

Add a per-tenant semaphore in sandbox_provisioner WASM:

  • Track active sandbox creation requests
  • Cap at 4-5 concurrent E2B API calls
  • Queue additional requests with backoff
  • Report queuing status so liveness timeouts can be extended

Long-term: Session scheduling with resource awareness

The session scheduler should track:

  • Available sandbox capacity (E2B account limits)
  • LLM API rate limits (Anthropic tier)
  • Active session count per tenant
  • And gate session Start / ProcessToolCalls based on available capacity

Workaround

Submit jobs in batches of 3-4, wait for completion, then submit the next batch. This is what was done manually in the previous session (one job at a time, all 11 completed).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions