Session concurrency overwhelms sandbox provisioning — no admission control

## Problem

When multiple CurationJobs are submitted simultaneously, all sessions spawn and attempt to provision E2B sandboxes in parallel. The sandbox provisioner has no backpressure, concurrency limiting, or queuing — it fires HTTP POSTs to the E2B API directly from the WASM action handler. Combined with no admission control on the CurationJob entity, this causes resource starvation and liveness timeouts.

## Evidence

10 regeneration jobs submitted simultaneously on Railway production:

| Session | Seq Nr | Turns | Sandbox? | Result |
|---------|--------|-------|----------|--------|
| Chapterglow | 96 | 14 | `sb-WtnVkEEzNNuhdt6NbQ6J07` | **Completed** |
| Flat Illustration | 131 | 20 | `sb-27eKYqeST9o6qfDbhzzYrX` | **Completed** |
| City Pop | 102 | 15 | `sb-UUHuQRgBUFitlZOxS45g9a` | **Completed** |
| WhimsiCollage | 53 | 7 | `sb-huQpisC7RbhEIezftyhl0n` | **Completed** |
| Neumorphic | 5 | 0 | NONE | **Failed** (30m timeout) |
| Watercolor | 5 | 0 | NONE | **Failed** (30m timeout) |
| Anime Scenic | 10 | 0 | NONE | **Failed** (30m timeout) |
| Neo Kawaii | 14 | 1 | NONE | **Failed** (30m timeout) |
| Kawaii Watercolor | 19 | 2 | NONE | **Failed** (30m timeout) |
| Shibuya | 5 | 0 | NONE | **Failed** (race condition, see #142) |

### Pattern

- **4 sessions got sandboxes and completed** — these were first to reach E2B
- **6 sessions never got sandboxes** — starved at the E2B API level
- Progressive degradation: 2 never prepared context (seq=5), 1 prepared context but never got a turn (seq=10), 2 got 1-2 turns but couldn't proceed without sandbox (seq=14, 19)
- All 6 hit the CurationJob `Running` state liveness timeout (30 minutes, `reset_on = ["RecordProgress"]`)

## Root Cause Analysis

Three missing controls compound:

### 1. No admission control on CurationJob

`curation_job.ioa.toml` has no `[admission]` block. All 10 `Submit` actions dispatch immediately and spawn sessions in parallel. No queuing, no concurrency cap.

**File**: `katagami-curation/specs/curation_job.ioa.toml`

### 2. No sandbox provisioner backpressure

`sandbox_provisioner/src/lib.rs` makes a direct HTTP POST to E2B `/sandboxes` from within the WASM action handler. No local rate limiting, no retry with backoff, no concurrency semaphore. When 10 sessions hit E2B simultaneously, the API rate-limits or times out on later requests.

**File**: `temper-agent/wasm/sandbox_provisioner/src/lib.rs`

### 3. Liveness timeout doesn't account for queued/waiting sessions

The CurationJob `Running` state has `after_seconds = 1800` with `reset_on = ["RecordProgress"]`. Sessions that are waiting for sandbox provisioning or LLM API capacity never call `RecordProgress`, so they're indistinguishable from truly stuck sessions. The liveness timeout kills them.

## Proposed Fix

### Short-term: Admission control on CurationJob

Add an `[admission]` block to gate the `Submit` action:

```toml
[admission]
max_concurrent_actions = { "Submit" = 4 }
queue_timeout_seconds = 1800
```

This serializes at the Temper level — only 4 sessions spawn at a time, remaining jobs queue until capacity frees up.

### Medium-term: Sandbox provisioner concurrency control

Add a per-tenant semaphore in `sandbox_provisioner` WASM:
- Track active sandbox creation requests
- Cap at 4-5 concurrent E2B API calls
- Queue additional requests with backoff
- Report queuing status so liveness timeouts can be extended

### Long-term: Session scheduling with resource awareness

The session scheduler should track:
- Available sandbox capacity (E2B account limits)
- LLM API rate limits (Anthropic tier)
- Active session count per tenant
- And gate session `Start` / `ProcessToolCalls` based on available capacity

## Workaround

Submit jobs in batches of 3-4, wait for completion, then submit the next batch. This is what was done manually in the previous session (one job at a time, all 11 completed).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Session concurrency overwhelms sandbox provisioning — no admission control #143

Problem

Evidence

Pattern

Root Cause Analysis

1. No admission control on CurationJob

2. No sandbox provisioner backpressure

3. Liveness timeout doesn't account for queued/waiting sessions

Proposed Fix

Short-term: Admission control on CurationJob

Medium-term: Sandbox provisioner concurrency control

Long-term: Session scheduling with resource awareness

Workaround

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Session	Seq Nr	Turns	Sandbox?	Result
Chapterglow	96	14	`sb-WtnVkEEzNNuhdt6NbQ6J07`	Completed
Flat Illustration	131	20	`sb-27eKYqeST9o6qfDbhzzYrX`	Completed
City Pop	102	15	`sb-UUHuQRgBUFitlZOxS45g9a`	Completed
WhimsiCollage	53	7	`sb-huQpisC7RbhEIezftyhl0n`	Completed
Neumorphic	5	0	NONE	Failed (30m timeout)
Watercolor	5	0	NONE	Failed (30m timeout)
Anime Scenic	10	0	NONE	Failed (30m timeout)
Neo Kawaii	14	1	NONE	Failed (30m timeout)
Kawaii Watercolor	19	2	NONE	Failed (30m timeout)
Shibuya	5	0	NONE	Failed (race condition, see #142)

Session concurrency overwhelms sandbox provisioning — no admission control #143

Description

Problem

Evidence

Pattern

Root Cause Analysis

1. No admission control on CurationJob

2. No sandbox provisioner backpressure

3. Liveness timeout doesn't account for queued/waiting sessions

Proposed Fix

Short-term: Admission control on CurationJob

Medium-term: Sandbox provisioner concurrency control

Long-term: Session scheduling with resource awareness

Workaround

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions