Problem
Under an 11-entity concurrent-create burst against the File entity type (OpenPaw paw-fs), 2 of 11 actor asks exceeded a 15-second total dispatch budget (P1 retry: 3 attempts × 5s). Observed failure surface:
workspace_provisioner: initial $value write failed (HTTP 409):
{"error":{"code":"ActionRejected",
"message":"actor dispatch exhausted after 3 attempt(s): ask timeout after 5s"}}
This is not an OpenPaw/File-consumer problem. A File $value write is trivial work: spawn the actor → append one event → reply. It should sustain hundreds of concurrent creates in well under a second each. The fact that 10-way concurrency is enough to push some asks past 15s means there is real contention on the shared dispatch path.
This was originally surfaced as "add admission control on File" during the 2026-04-18 incident analysis, but admission control on File would just mask a platform slowness — it is the wrong layer for a primitive this cheap. The right answer is to find the actual contention and remove it.
What we know
- 2026-04-18 Railway prod, single-server deploy, ~11 parallel File creates during
workspace_provisioner.
- Retry primitive (ADR-0048) is doing its job — the underlying bottleneck was hidden before retries were wired; now it is visible as the exhaustion message above.
- Cold-start (actor not yet spawned) appears to be the heavy case; subsequent asks to the same actor are fast.
Suspected bottlenecks (unverified — need profiling)
- Actor-registry contention — mutex on the per-tenant registry when 11 cold actors spawn simultaneously.
- Event-store write amplification — shared writer lock or fsync boundary serializes concurrent first-appends.
- OpenTelemetry span export on the hot path — synchronous exporter can add tens of ms under load; plausible but not proven.
- Idempotency cache warm-up — first-touch per actor rebuilds state.
- Cedar policy compilation — if compiled lazily per action, first ask pays the compile cost.
Acceptance criteria
Same 11-entity concurrent burst must complete without any retry exhaustion, and p99 ask latency for File.Create must be sub-second under bursts up to at least 100-way concurrency. Admission control on File is a sanity ceiling, not the mitigation — caps should be hundreds-to-thousands, not tens.
Investigation plan
- Enable Datadog Continuous Profiler on
openpaw-server (CPU + wall-clock). Tag profiler samples with tenant, entity_type so we can slice by File bursts.
- Add fine-grained spans inside `dispatch_tenant_action_core` — one per phase: `actor_spawn`, `cedar_eval`, `idempotency_check`, `event_append`, `reply_send`. Today the entire dispatch is one span.
- Add contention histograms: `temper_actor_registry_lock_wait_ms`, `temper_event_store_append_wait_ms`, `temper_cedar_eval_ms`.
- Reproduce locally with a 100-way burst harness (throwaway File entity, no LLM) — no LLM-path variance, pure dispatch hot path.
- Profile, identify the top phase in p99, fix at source (shard the registry, batch event appends, move Cedar compilation to spec-load, make OTel export async, etc).
Related
- Incident 2026-04-18: Katagami regeneration burst, 4/21 sessions failed (2 from this issue, 2 from #(handler deadline issue — link on create)).
- ADR-0048 (retry + error taxonomy) — masks this symptom but cannot fix it.
- ADR-0051 (admission control) — intended for expensive operations (sandboxes, provider calls), not cheap primitives like File. Using admission here would paper over the platform bug.
Out of scope
- Raising `TEMPER_ACTION_TIMEOUT_SECS` past 5s.
- Increasing retry attempts past 3.
- Declaring `[admission]` on File as a "fix".
These are explicitly band-aids.
Problem
Under an 11-entity concurrent-create burst against the
Fileentity type (OpenPawpaw-fs), 2 of 11 actor asks exceeded a 15-second total dispatch budget (P1 retry: 3 attempts × 5s). Observed failure surface:This is not an OpenPaw/File-consumer problem. A File
$valuewrite is trivial work: spawn the actor → append one event → reply. It should sustain hundreds of concurrent creates in well under a second each. The fact that 10-way concurrency is enough to push some asks past 15s means there is real contention on the shared dispatch path.This was originally surfaced as "add admission control on File" during the 2026-04-18 incident analysis, but admission control on File would just mask a platform slowness — it is the wrong layer for a primitive this cheap. The right answer is to find the actual contention and remove it.
What we know
workspace_provisioner.Suspected bottlenecks (unverified — need profiling)
Acceptance criteria
Same 11-entity concurrent burst must complete without any retry exhaustion, and p99 ask latency for File.Create must be sub-second under bursts up to at least 100-way concurrency. Admission control on File is a sanity ceiling, not the mitigation — caps should be hundreds-to-thousands, not tens.
Investigation plan
openpaw-server(CPU + wall-clock). Tag profiler samples withtenant,entity_typeso we can slice by File bursts.Related
Out of scope
These are explicitly band-aids.