diff --git a/CLAUDE.md b/CLAUDE.md index e951bc45..d707c55f 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -394,6 +394,10 @@ These are prescriptive rules not derivable from reading the code: - **`sam-local-codex` is the first production ADR-005 wrapper agent** (live 2026-04-27). Runs on user laptop via `commonly agent run sam-local-codex` (nohup'd), polls `https://api-dev.commonly.me`, spawns local codex CLI 0.125.0. Boot pod: `Codex Hub` `69ef02b036b742e2e2c0c4af`. To revive if dead: `nohup commonly agent run sam-local-codex > ~/.commonly/logs/sam-local-codex.log 2>&1 & disown`. To re-attach from scratch: `commonly agent attach codex --pod 69ef02b036b742e2e2c0c4af --name sam-local-codex --instance dev`. +- **`cloud-codex` runtime — cluster-side variant of sam-local-codex** (live 2026-05-15, PRs #362–#369). `k8s/helm/commonly/templates/agents/cloud-codex-deployment.yaml` provisions one Deployment + PVC per agent under `agents.cloudCodex.agents.` in values. Pod runs `commonly agent run ` + codex CLI inside the cluster. Codex CLI is configured (via `~/.codex/config.toml`) to call **LiteLLM**, not chatgpt.com directly — model_provider=litellm, base_url=`http://litellm:4000/v1`, wire_api=`responses`, env_key=`LITELLM_API_KEY`. Same auth surface as every openclaw moltbot agent (single rotator, single quota pool, single observability). Use `agentName=codex` (in AGENT_TYPES) — `cloud-codex` agentName is NOT in AGENT_TYPES so the cleanup sweep marks it stale. First production agent: Cody (`agentName=codex`, `instanceId=cody`), live 2026-05-15. + +- **ChatGPT OAuth is cluster-IP-bound — never device-auth elsewhere.** ChatGPT/Codex's server-side session table binds OAuth sessions to the IP/device that completed device-auth. A token device-auth'd on a laptop and uploaded to the cluster gets `401 token_invalidated` on first cluster call, regardless of JWT exp (confirmed empirically 2026-05-14). The fix is to device-auth from INSIDE the cluster: the LiteLLM pod has a `codex-cli` sidecar (PR #365) — operator runs `kubectl exec -n commonly-dev -it deploy/litellm -c codex-cli -- /scripts/auth-login.sh ` for each account; resulting `auth.json` lands on the `litellm-chatgpt-auth` PVC. Rotator prefers those pod-side `/chatgpt-auth/auth-{1,2,3}.json` files over env-var-fed legacy tokens (`OPENAI_CODEX_ACCESS_TOKEN`*), which are now considered dead. Never `codex login --device-auth` an account on your laptop if that account is in cluster rotation — invalidates the cluster session immediately. Currently account-1 + account-2 in rotation; account-3 reserved as operator's laptop-personal. + - **openclaw v2026.3.7+ gateway ships `/app/dist/` only**, not `/app/src/`. Imports from `../../../src/...` crash. Use `openclaw/plugin-sdk` instead. - **ESO owns `api-keys` secret.** Direct `kubectl patch` is overwritten on next 1h ESO sync. Always update GCP SM first, then force-sync: `kubectl annotate externalsecret api-keys force-sync=$(date +%s) -n commonly-dev --overwrite`. diff --git a/docs/adr/ADR-014-cloud-codex-runtime-and-shared-auth-surface.md b/docs/adr/ADR-014-cloud-codex-runtime-and-shared-auth-surface.md new file mode 100644 index 00000000..45c5f8c1 --- /dev/null +++ b/docs/adr/ADR-014-cloud-codex-runtime-and-shared-auth-surface.md @@ -0,0 +1,91 @@ +# ADR-014: Cloud-Codex Runtime and Shared LiteLLM Auth Surface + +**Status:** Accepted +**Date:** 2026-05-15 +**Supersedes:** none +**Relates to:** [ADR-004 CAP](ADR-004-commonly-agent-protocol.md), [ADR-005 Local CLI Wrapper Driver](ADR-005-local-cli-wrapper-driver.md), [ADR-008 Agent Environment Primitive](ADR-008-agent-environment-primitive.md) + +## Context + +ADR-005 introduced the local-CLI wrapper driver: `commonly agent attach codex --pod ... --instance dev` on an operator laptop polls CAP and shells out to the local `codex` binary. The first production wrapper agent — `sam-local-codex` — proved the pattern but exposed a structural limit: it required an operator's laptop to be online. Anyone wanting a "real" cloud agent on the codex runtime had no path. + +Three forces converged in May 2026: + +1. **Demand for a cluster-resident codex agent.** Cody was meant to be a permanent fixture in the Codex Hub pod, not tethered to a laptop. +2. **ChatGPT OAuth is cluster-IP-bound.** Empirically confirmed 2026-05-14: ChatGPT/Codex binds OAuth sessions server-side to the device that completed `codex login --device-auth`. Tokens device-auth'd on a laptop and uploaded to the cluster (via GCP SM → ExternalSecret → `OPENAI_CODEX_ACCESS_TOKEN`*) returned `401 token_invalidated` on first cluster call regardless of JWT exp. Structural, not transient. +3. **Multi-runtime coexistence is a load-bearing product invariant.** "Commonly doesn't run your agent — your agent connects to Commonly" (CLAUDE.md product vision). Collapsing Cody onto openclaw moltbot to "share auth" would have violated the core positioning. + +The naive options each failed: + +- **Per-agent cloud codex pod, per-agent `codex login`**: every new pod would need its own device-auth ceremony. Operator-toil scales linearly with agent count. +- **Centralize on a single runtime (openclaw moltbot)**: collapses the multi-runtime invariant. We explicitly want codex CLI's sandbox / tool-use / session semantics alongside moltbot. +- **Keep doing laptop-device-auth + upload**: dead-on-arrival under cluster-IP binding. + +## Decision + +**Separate the runtime from the auth surface.** Runtime is *what code executes the agent loop* (codex CLI, openclaw moltbot, future). Auth surface is *what makes the outbound HTTPS call to ChatGPT*. The two are orthogonal. + +### Concretely + +1. **New runtime adapter: `cloud-codex`.** `k8s/helm/commonly/templates/agents/cloud-codex-deployment.yaml` provisions a per-agent Deployment + PVC under `.Values.agents.cloudCodex.agents.`. Each pod runs the same `commonly agent attach codex` flow a laptop user runs — inside the cluster. PVC mounts at `/state` and holds CAP token + `~/.codex/config.toml`. Initialized with the CLI + `@openai/codex` via an init container. + +2. **Codex CLI does NOT call chatgpt.com directly.** Each cloud-codex pod's `~/.codex/config.toml` declares LiteLLM as the model provider: + + ```toml + model = "gpt-5.4" + model_provider = "litellm" + [model_providers.litellm] + name = "LiteLLM" + base_url = "http://litellm:4000/v1" + wire_api = "responses" + env_key = "LITELLM_API_KEY" + ``` + + `LITELLM_API_KEY` is a per-agent LiteLLM virtual key injected from a k8s Secret. The codex CLI's sandbox, tool-use, session, and prompt semantics are preserved — only the HTTPS layer is redirected. + +3. **LiteLLM is the single ChatGPT-OAuth holder for the cluster.** A new `codex-cli` sidecar on the LiteLLM Deployment ships `@openai/codex` for *operator* use. The operator runs: + + ```bash + kubectl exec -n commonly-dev -it deploy/litellm -c codex-cli -- /scripts/auth-login.sh + ``` + + …for each ChatGPT account to be in cluster rotation. Device-auth originates from inside the cluster pod, so the server-side IP binding works *for* us instead of against us. The resulting `auth.json` lands on a new persistent volume — `litellm-chatgpt-auth` (RWO 1Gi PVC) — as `/chatgpt-auth/auth-.json`. + +4. **The codex-auth-rotator prefers pod-side files.** `get_candidates()` first reads `/chatgpt-auth/auth-N.json` files; only falls back to env-var-fed legacy tokens (`OPENAI_CODEX_ACCESS_TOKEN`*) if no pod-side files exist. The legacy env-var path is retained for backward-compat but is dead-on-arrival from the cluster's POV. + +5. **All runtimes share this one auth surface.** OpenClaw moltbot agents (Nova, Pixel, Liz, …) and cloud-codex agents (Cody, …) both route through the same LiteLLM. One device-auth chain serves the whole cluster. + +### Identity rule + +Cloud-codex agents register as `agentName: 'codex'` (in `agentIdentityService.AGENT_TYPES` → `runtime: 'codex'`) with `instanceId` varying per agent. **`agentName: 'cloud-codex'` is NOT in AGENT_TYPES** — the cleanup sweep would mark it stale. The Helm value `registryAgentName` should always be `codex` for cluster-side codex agents. From V2 inspector's POV they read as `runtimeType: 'codex'` + `host: 'cloud'`, identical to a future cloud-managed codex offering. + +## Consequences + +### Positive + +- **One device-auth ceremony covers the whole cluster.** Adding a new cloud-codex agent requires zero auth work — just helm values + a token+key secret pair. +- **Multi-runtime invariant preserved.** Cody stays a codex-runtime agent. Future runtimes (gemini, claude-code, custom) can follow the same pattern: keep your runtime, share LiteLLM. +- **Operator runbook is short.** `kubectl exec ... auth-login.sh N` is the entire ceremony per account. No GCP SM patching, no ExternalSecret force-syncs, no helm upgrades. +- **PVC survives helm upgrades.** Pod-side `auth-N.json` files are not wiped on every deploy. + +### Negative + +- **PVC is RWO single-writer.** LiteLLM Deployment must use `strategy.type: Recreate` (not RollingUpdate). Brief downtime on every deploy. +- **Account 3 is reserved as operator-personal.** ChatGPT's IP binding means the operator cannot use account-3 from a laptop AND have it in cluster rotation. We give up one rotation slot for operator dev ergonomics. Acceptable while team is small; revisit at higher scale. +- **The legacy env-var-fed path is dead but still wired.** `OPENAI_CODEX_ACCESS_TOKEN[_N]` env vars still exist in deployment YAML and GCP SM. They're a no-op now but add noise. Cleanup is a follow-up — not load-bearing. +- **Codex CLI's reasoning/responses semantics depend on LiteLLM's `chatgpt/` provider.** If LiteLLM drops or breaks `wire_api=responses`, every cloud-codex agent breaks. Mitigation: LiteLLM is already a load-bearing dep for moltbot agents; same blast radius. + +### Neutral + +- **Cloud-codex pods do NOT need their own device-auth.** This is correct and intentional — auth lives at the LiteLLM layer. +- **The pattern generalizes.** A `cloud-claude-code` or `cloud-gemini` agent would follow the same template: per-agent Deployment + PVC, config the CLI to call LiteLLM, share the cluster auth surface. + +## Operator Runbook + +See `.claude/skills/llm-routing/SKILL.md` "Codex Multi-Account Rotation" and `.claude/skills/prod-agent-ops/SKILL.md` section O for the live commands. Skill files are kept up-to-date; this ADR captures the *why*. + +## Open Follow-ups + +- Retire the env-var-fed legacy path entirely (`codex-auth-seed` init container + `OPENAI_CODEX_ACCESS_TOKEN[_N]` secrets) once pod-side files have been stable for one cycle. +- If LiteLLM ever needs to scale horizontally, the RWO PVC becomes the binding constraint — would need to move `auth-N.json` to a ReadWriteMany backing store or a shared secret manager call path. +- ADR-005 should be amended to note this cluster-side variant of the wrapper pattern. diff --git a/k8s/helm/commonly/templates/agents/cloud-codex-deployment.yaml b/k8s/helm/commonly/templates/agents/cloud-codex-deployment.yaml index 369b00db..0a58edc9 100644 --- a/k8s/helm/commonly/templates/agents/cloud-codex-deployment.yaml +++ b/k8s/helm/commonly/templates/agents/cloud-codex-deployment.yaml @@ -147,25 +147,35 @@ spec: EOF chmod 600 /state/.commonly/tokens/${COMMONLY_AGENT_NAME}.json - # Wait for codex auth.json. ChatGPT binds OAuth to the IP that - # ran device-auth; running `codex login --device-auth` INSIDE - # this pod is the whole point. If auth.json is missing, sit - # idle and log clear instructions so the operator's first - # `kubectl exec` shows them exactly what to do. - if [ ! -s /state/.codex/auth.json ]; then - echo "[cloud-codex] no codex auth.json on PVC — waiting for device-auth" - echo "[cloud-codex] run this once to bind the cluster session:" - echo "[cloud-codex] kubectl exec -n {{ include "commonly.namespace" $ }} -it deploy/cloud-codex-{{ $name }} -- codex login --device-auth" - echo "[cloud-codex] (after completing in browser, the pod will resume on next reboot)" - # Sleep loop so operator can exec in. Restart-on-success is the - # cleanest UX — when auth.json appears, we want to re-enter the - # main path, and the simplest way to do that is a fresh boot. - while [ ! -s /state/.codex/auth.json ]; do sleep 10; done - echo "[cloud-codex] auth.json present — restarting to enter run loop" - exit 0 + # Seed ~/.codex/config.toml so codex CLI routes its model calls + # through LiteLLM instead of straight to chatgpt.com. The LiteLLM + # pod already holds cluster-IP-bound auth.json (rotator-managed, + # operator-device-auth'd), so this agent shares the same auth + # surface as every other openclaw moltbot agent — single quota + # pool, single rotation, single observability. + # + # Runtime stays codex: codex CLI still spawns, still sandboxes, + # still owns tool use and sessions. Only the HTTPS layer is proxied. + cat > /state/.codex/config.toml <.litellmBaseUrl. + litellmBaseUrl: http://litellm:4000/v1 # Per-agent map. Each key is the agent name that maps to an # AgentInstallation already created via /api/registry/install. The # token secret should be pre-populated with the cm_agent_* runtime