Problem
When multiple CurationJobs are submitted simultaneously, all sessions spawn and attempt to provision E2B sandboxes in parallel. The sandbox provisioner has no backpressure, concurrency limiting, or queuing — it fires HTTP POSTs to the E2B API directly from the WASM action handler. Combined with no admission control on the CurationJob entity, this causes resource starvation and liveness timeouts.
Evidence
10 regeneration jobs submitted simultaneously on Railway production:
| Session |
Seq Nr |
Turns |
Sandbox? |
Result |
| Chapterglow |
96 |
14 |
sb-WtnVkEEzNNuhdt6NbQ6J07 |
Completed |
| Flat Illustration |
131 |
20 |
sb-27eKYqeST9o6qfDbhzzYrX |
Completed |
| City Pop |
102 |
15 |
sb-UUHuQRgBUFitlZOxS45g9a |
Completed |
| WhimsiCollage |
53 |
7 |
sb-huQpisC7RbhEIezftyhl0n |
Completed |
| Neumorphic |
5 |
0 |
NONE |
Failed (30m timeout) |
| Watercolor |
5 |
0 |
NONE |
Failed (30m timeout) |
| Anime Scenic |
10 |
0 |
NONE |
Failed (30m timeout) |
| Neo Kawaii |
14 |
1 |
NONE |
Failed (30m timeout) |
| Kawaii Watercolor |
19 |
2 |
NONE |
Failed (30m timeout) |
| Shibuya |
5 |
0 |
NONE |
Failed (race condition, see #142) |
Pattern
- 4 sessions got sandboxes and completed — these were first to reach E2B
- 6 sessions never got sandboxes — starved at the E2B API level
- Progressive degradation: 2 never prepared context (seq=5), 1 prepared context but never got a turn (seq=10), 2 got 1-2 turns but couldn't proceed without sandbox (seq=14, 19)
- All 6 hit the CurationJob
Running state liveness timeout (30 minutes, reset_on = ["RecordProgress"])
Root Cause Analysis
Three missing controls compound:
1. No admission control on CurationJob
curation_job.ioa.toml has no [admission] block. All 10 Submit actions dispatch immediately and spawn sessions in parallel. No queuing, no concurrency cap.
File: katagami-curation/specs/curation_job.ioa.toml
2. No sandbox provisioner backpressure
sandbox_provisioner/src/lib.rs makes a direct HTTP POST to E2B /sandboxes from within the WASM action handler. No local rate limiting, no retry with backoff, no concurrency semaphore. When 10 sessions hit E2B simultaneously, the API rate-limits or times out on later requests.
File: temper-agent/wasm/sandbox_provisioner/src/lib.rs
3. Liveness timeout doesn't account for queued/waiting sessions
The CurationJob Running state has after_seconds = 1800 with reset_on = ["RecordProgress"]. Sessions that are waiting for sandbox provisioning or LLM API capacity never call RecordProgress, so they're indistinguishable from truly stuck sessions. The liveness timeout kills them.
Proposed Fix
Short-term: Admission control on CurationJob
Add an [admission] block to gate the Submit action:
[admission]
max_concurrent_actions = { "Submit" = 4 }
queue_timeout_seconds = 1800
This serializes at the Temper level — only 4 sessions spawn at a time, remaining jobs queue until capacity frees up.
Medium-term: Sandbox provisioner concurrency control
Add a per-tenant semaphore in sandbox_provisioner WASM:
- Track active sandbox creation requests
- Cap at 4-5 concurrent E2B API calls
- Queue additional requests with backoff
- Report queuing status so liveness timeouts can be extended
Long-term: Session scheduling with resource awareness
The session scheduler should track:
- Available sandbox capacity (E2B account limits)
- LLM API rate limits (Anthropic tier)
- Active session count per tenant
- And gate session
Start / ProcessToolCalls based on available capacity
Workaround
Submit jobs in batches of 3-4, wait for completion, then submit the next batch. This is what was done manually in the previous session (one job at a time, all 11 completed).
Problem
When multiple CurationJobs are submitted simultaneously, all sessions spawn and attempt to provision E2B sandboxes in parallel. The sandbox provisioner has no backpressure, concurrency limiting, or queuing — it fires HTTP POSTs to the E2B API directly from the WASM action handler. Combined with no admission control on the CurationJob entity, this causes resource starvation and liveness timeouts.
Evidence
10 regeneration jobs submitted simultaneously on Railway production:
sb-WtnVkEEzNNuhdt6NbQ6J07sb-27eKYqeST9o6qfDbhzzYrXsb-UUHuQRgBUFitlZOxS45g9asb-huQpisC7RbhEIezftyhl0nPattern
Runningstate liveness timeout (30 minutes,reset_on = ["RecordProgress"])Root Cause Analysis
Three missing controls compound:
1. No admission control on CurationJob
curation_job.ioa.tomlhas no[admission]block. All 10Submitactions dispatch immediately and spawn sessions in parallel. No queuing, no concurrency cap.File:
katagami-curation/specs/curation_job.ioa.toml2. No sandbox provisioner backpressure
sandbox_provisioner/src/lib.rsmakes a direct HTTP POST to E2B/sandboxesfrom within the WASM action handler. No local rate limiting, no retry with backoff, no concurrency semaphore. When 10 sessions hit E2B simultaneously, the API rate-limits or times out on later requests.File:
temper-agent/wasm/sandbox_provisioner/src/lib.rs3. Liveness timeout doesn't account for queued/waiting sessions
The CurationJob
Runningstate hasafter_seconds = 1800withreset_on = ["RecordProgress"]. Sessions that are waiting for sandbox provisioning or LLM API capacity never callRecordProgress, so they're indistinguishable from truly stuck sessions. The liveness timeout kills them.Proposed Fix
Short-term: Admission control on CurationJob
Add an
[admission]block to gate theSubmitaction:This serializes at the Temper level — only 4 sessions spawn at a time, remaining jobs queue until capacity frees up.
Medium-term: Sandbox provisioner concurrency control
Add a per-tenant semaphore in
sandbox_provisionerWASM:Long-term: Session scheduling with resource awareness
The session scheduler should track:
Start/ProcessToolCallsbased on available capacityWorkaround
Submit jobs in batches of 3-4, wait for completion, then submit the next batch. This is what was done manually in the previous session (one job at a time, all 11 completed).