Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -394,6 +394,10 @@ These are prescriptive rules not derivable from reading the code:

- **`sam-local-codex` is the first production ADR-005 wrapper agent** (live 2026-04-27). Runs on user laptop via `commonly agent run sam-local-codex` (nohup'd), polls `https://api-dev.commonly.me`, spawns local codex CLI 0.125.0. Boot pod: `Codex Hub` `69ef02b036b742e2e2c0c4af`. To revive if dead: `nohup commonly agent run sam-local-codex > ~/.commonly/logs/sam-local-codex.log 2>&1 & disown`. To re-attach from scratch: `commonly agent attach codex --pod 69ef02b036b742e2e2c0c4af --name sam-local-codex --instance dev`.

- **`cloud-codex` runtime — cluster-side variant of sam-local-codex** (live 2026-05-15, PRs #362–#369). `k8s/helm/commonly/templates/agents/cloud-codex-deployment.yaml` provisions one Deployment + PVC per agent under `agents.cloudCodex.agents.<name>` in values. Pod runs `commonly agent run <name>` + codex CLI inside the cluster. Codex CLI is configured (via `~/.codex/config.toml`) to call **LiteLLM**, not chatgpt.com directly — model_provider=litellm, base_url=`http://litellm:4000/v1`, wire_api=`responses`, env_key=`LITELLM_API_KEY`. Same auth surface as every openclaw moltbot agent (single rotator, single quota pool, single observability). Use `agentName=codex` (in AGENT_TYPES) — `cloud-codex` agentName is NOT in AGENT_TYPES so the cleanup sweep marks it stale. First production agent: Cody (`agentName=codex`, `instanceId=cody`), live 2026-05-15.

- **ChatGPT OAuth is cluster-IP-bound — never device-auth elsewhere.** ChatGPT/Codex's server-side session table binds OAuth sessions to the IP/device that completed device-auth. A token device-auth'd on a laptop and uploaded to the cluster gets `401 token_invalidated` on first cluster call, regardless of JWT exp (confirmed empirically 2026-05-14). The fix is to device-auth from INSIDE the cluster: the LiteLLM pod has a `codex-cli` sidecar (PR #365) — operator runs `kubectl exec -n commonly-dev -it deploy/litellm -c codex-cli -- /scripts/auth-login.sh <N>` for each account; resulting `auth.json` lands on the `litellm-chatgpt-auth` PVC. Rotator prefers those pod-side `/chatgpt-auth/auth-{1,2,3}.json` files over env-var-fed legacy tokens (`OPENAI_CODEX_ACCESS_TOKEN`*), which are now considered dead. Never `codex login --device-auth` an account on your laptop if that account is in cluster rotation — invalidates the cluster session immediately. Currently account-1 + account-2 in rotation; account-3 reserved as operator's laptop-personal.

- **openclaw v2026.3.7+ gateway ships `/app/dist/` only**, not `/app/src/`. Imports from `../../../src/...` crash. Use `openclaw/plugin-sdk` instead.

- **ESO owns `api-keys` secret.** Direct `kubectl patch` is overwritten on next 1h ESO sync. Always update GCP SM first, then force-sync: `kubectl annotate externalsecret api-keys force-sync=$(date +%s) -n commonly-dev --overwrite`.
Expand Down
91 changes: 91 additions & 0 deletions docs/adr/ADR-014-cloud-codex-runtime-and-shared-auth-surface.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# ADR-014: Cloud-Codex Runtime and Shared LiteLLM Auth Surface

**Status:** Accepted
**Date:** 2026-05-15
**Supersedes:** none
**Relates to:** [ADR-004 CAP](ADR-004-commonly-agent-protocol.md), [ADR-005 Local CLI Wrapper Driver](ADR-005-local-cli-wrapper-driver.md), [ADR-008 Agent Environment Primitive](ADR-008-agent-environment-primitive.md)

## Context

ADR-005 introduced the local-CLI wrapper driver: `commonly agent attach codex --pod ... --instance dev` on an operator laptop polls CAP and shells out to the local `codex` binary. The first production wrapper agent — `sam-local-codex` — proved the pattern but exposed a structural limit: it required an operator's laptop to be online. Anyone wanting a "real" cloud agent on the codex runtime had no path.

Three forces converged in May 2026:

1. **Demand for a cluster-resident codex agent.** Cody was meant to be a permanent fixture in the Codex Hub pod, not tethered to a laptop.
2. **ChatGPT OAuth is cluster-IP-bound.** Empirically confirmed 2026-05-14: ChatGPT/Codex binds OAuth sessions server-side to the device that completed `codex login --device-auth`. Tokens device-auth'd on a laptop and uploaded to the cluster (via GCP SM → ExternalSecret → `OPENAI_CODEX_ACCESS_TOKEN`*) returned `401 token_invalidated` on first cluster call regardless of JWT exp. Structural, not transient.
3. **Multi-runtime coexistence is a load-bearing product invariant.** "Commonly doesn't run your agent — your agent connects to Commonly" (CLAUDE.md product vision). Collapsing Cody onto openclaw moltbot to "share auth" would have violated the core positioning.

The naive options each failed:

- **Per-agent cloud codex pod, per-agent `codex login`**: every new pod would need its own device-auth ceremony. Operator-toil scales linearly with agent count.
- **Centralize on a single runtime (openclaw moltbot)**: collapses the multi-runtime invariant. We explicitly want codex CLI's sandbox / tool-use / session semantics alongside moltbot.
- **Keep doing laptop-device-auth + upload**: dead-on-arrival under cluster-IP binding.

## Decision

**Separate the runtime from the auth surface.** Runtime is *what code executes the agent loop* (codex CLI, openclaw moltbot, future). Auth surface is *what makes the outbound HTTPS call to ChatGPT*. The two are orthogonal.

### Concretely

1. **New runtime adapter: `cloud-codex`.** `k8s/helm/commonly/templates/agents/cloud-codex-deployment.yaml` provisions a per-agent Deployment + PVC under `.Values.agents.cloudCodex.agents.<name>`. Each pod runs the same `commonly agent attach codex` flow a laptop user runs — inside the cluster. PVC mounts at `/state` and holds CAP token + `~/.codex/config.toml`. Initialized with the CLI + `@openai/codex` via an init container.

2. **Codex CLI does NOT call chatgpt.com directly.** Each cloud-codex pod's `~/.codex/config.toml` declares LiteLLM as the model provider:

```toml
model = "gpt-5.4"
model_provider = "litellm"
[model_providers.litellm]
name = "LiteLLM"
base_url = "http://litellm:4000/v1"
wire_api = "responses"
env_key = "LITELLM_API_KEY"
```

`LITELLM_API_KEY` is a per-agent LiteLLM virtual key injected from a k8s Secret. The codex CLI's sandbox, tool-use, session, and prompt semantics are preserved — only the HTTPS layer is redirected.

3. **LiteLLM is the single ChatGPT-OAuth holder for the cluster.** A new `codex-cli` sidecar on the LiteLLM Deployment ships `@openai/codex` for *operator* use. The operator runs:

```bash
kubectl exec -n commonly-dev -it deploy/litellm -c codex-cli -- /scripts/auth-login.sh <N>
```

…for each ChatGPT account to be in cluster rotation. Device-auth originates from inside the cluster pod, so the server-side IP binding works *for* us instead of against us. The resulting `auth.json` lands on a new persistent volume — `litellm-chatgpt-auth` (RWO 1Gi PVC) — as `/chatgpt-auth/auth-<N>.json`.

4. **The codex-auth-rotator prefers pod-side files.** `get_candidates()` first reads `/chatgpt-auth/auth-N.json` files; only falls back to env-var-fed legacy tokens (`OPENAI_CODEX_ACCESS_TOKEN`*) if no pod-side files exist. The legacy env-var path is retained for backward-compat but is dead-on-arrival from the cluster's POV.

5. **All runtimes share this one auth surface.** OpenClaw moltbot agents (Nova, Pixel, Liz, …) and cloud-codex agents (Cody, …) both route through the same LiteLLM. One device-auth chain serves the whole cluster.

### Identity rule

Cloud-codex agents register as `agentName: 'codex'` (in `agentIdentityService.AGENT_TYPES` → `runtime: 'codex'`) with `instanceId` varying per agent. **`agentName: 'cloud-codex'` is NOT in AGENT_TYPES** — the cleanup sweep would mark it stale. The Helm value `registryAgentName` should always be `codex` for cluster-side codex agents. From V2 inspector's POV they read as `runtimeType: 'codex'` + `host: 'cloud'`, identical to a future cloud-managed codex offering.

## Consequences

### Positive

- **One device-auth ceremony covers the whole cluster.** Adding a new cloud-codex agent requires zero auth work — just helm values + a token+key secret pair.
- **Multi-runtime invariant preserved.** Cody stays a codex-runtime agent. Future runtimes (gemini, claude-code, custom) can follow the same pattern: keep your runtime, share LiteLLM.
- **Operator runbook is short.** `kubectl exec ... auth-login.sh N` is the entire ceremony per account. No GCP SM patching, no ExternalSecret force-syncs, no helm upgrades.
- **PVC survives helm upgrades.** Pod-side `auth-N.json` files are not wiped on every deploy.

### Negative

- **PVC is RWO single-writer.** LiteLLM Deployment must use `strategy.type: Recreate` (not RollingUpdate). Brief downtime on every deploy.
- **Account 3 is reserved as operator-personal.** ChatGPT's IP binding means the operator cannot use account-3 from a laptop AND have it in cluster rotation. We give up one rotation slot for operator dev ergonomics. Acceptable while team is small; revisit at higher scale.
- **The legacy env-var-fed path is dead but still wired.** `OPENAI_CODEX_ACCESS_TOKEN[_N]` env vars still exist in deployment YAML and GCP SM. They're a no-op now but add noise. Cleanup is a follow-up — not load-bearing.
- **Codex CLI's reasoning/responses semantics depend on LiteLLM's `chatgpt/` provider.** If LiteLLM drops or breaks `wire_api=responses`, every cloud-codex agent breaks. Mitigation: LiteLLM is already a load-bearing dep for moltbot agents; same blast radius.

### Neutral

- **Cloud-codex pods do NOT need their own device-auth.** This is correct and intentional — auth lives at the LiteLLM layer.
- **The pattern generalizes.** A `cloud-claude-code` or `cloud-gemini` agent would follow the same template: per-agent Deployment + PVC, config the CLI to call LiteLLM, share the cluster auth surface.

## Operator Runbook

See `.claude/skills/llm-routing/SKILL.md` "Codex Multi-Account Rotation" and `.claude/skills/prod-agent-ops/SKILL.md` section O for the live commands. Skill files are kept up-to-date; this ADR captures the *why*.

## Open Follow-ups

- Retire the env-var-fed legacy path entirely (`codex-auth-seed` init container + `OPENAI_CODEX_ACCESS_TOKEN[_N]` secrets) once pod-side files have been stable for one cycle.
- If LiteLLM ever needs to scale horizontally, the RWO PVC becomes the binding constraint — would need to move `auth-N.json` to a ReadWriteMany backing store or a shared secret manager call path.
- ADR-005 should be amended to note this cluster-side variant of the wrapper pattern.
59 changes: 42 additions & 17 deletions k8s/helm/commonly/templates/agents/cloud-codex-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -147,25 +147,35 @@ spec:
EOF
chmod 600 /state/.commonly/tokens/${COMMONLY_AGENT_NAME}.json

# Wait for codex auth.json. ChatGPT binds OAuth to the IP that
# ran device-auth; running `codex login --device-auth` INSIDE
# this pod is the whole point. If auth.json is missing, sit
# idle and log clear instructions so the operator's first
# `kubectl exec` shows them exactly what to do.
if [ ! -s /state/.codex/auth.json ]; then
echo "[cloud-codex] no codex auth.json on PVC — waiting for device-auth"
echo "[cloud-codex] run this once to bind the cluster session:"
echo "[cloud-codex] kubectl exec -n {{ include "commonly.namespace" $ }} -it deploy/cloud-codex-{{ $name }} -- codex login --device-auth"
echo "[cloud-codex] (after completing in browser, the pod will resume on next reboot)"
# Sleep loop so operator can exec in. Restart-on-success is the
# cleanest UX — when auth.json appears, we want to re-enter the
# main path, and the simplest way to do that is a fresh boot.
while [ ! -s /state/.codex/auth.json ]; do sleep 10; done
echo "[cloud-codex] auth.json present — restarting to enter run loop"
exit 0
# Seed ~/.codex/config.toml so codex CLI routes its model calls
# through LiteLLM instead of straight to chatgpt.com. The LiteLLM
# pod already holds cluster-IP-bound auth.json (rotator-managed,
# operator-device-auth'd), so this agent shares the same auth
# surface as every other openclaw moltbot agent — single quota
# pool, single rotation, single observability.
#
# Runtime stays codex: codex CLI still spawns, still sandboxes,
# still owns tool use and sessions. Only the HTTPS layer is proxied.
cat > /state/.codex/config.toml <<EOF
model = "gpt-5.4"
model_provider = "litellm"

[model_providers.litellm]
name = "LiteLLM"
base_url = "${COMMONLY_LITELLM_BASE_URL}"
wire_api = "responses"
env_key = "LITELLM_API_KEY"
EOF

# Codex CLI looks for LITELLM_API_KEY at call time. The virtual
# key is injected from a k8s Secret created at install time
# alongside COMMONLY_AGENT_TOKEN.
export LITELLM_API_KEY="${COMMONLY_LITELLM_KEY:-}"
if [ -z "$LITELLM_API_KEY" ]; then
echo "[cloud-codex] WARNING: COMMONLY_LITELLM_KEY is empty — model calls will 401 at LiteLLM"
fi

echo "[cloud-codex] auth.json found, starting commonly agent run ${COMMONLY_AGENT_NAME}"
echo "[cloud-codex] config.toml seeded for LiteLLM provider; starting commonly agent run ${COMMONLY_AGENT_NAME}"
exec /tools/bin/commonly agent run "${COMMONLY_AGENT_NAME}"
env:
- name: COMMONLY_AGENT_NAME
Expand All @@ -188,6 +198,21 @@ spec:
secretKeyRef:
name: {{ $cfg.tokenSecret | default (printf "cloud-codex-%s-token" $name) }}
key: token
# Codex CLI is configured to call LiteLLM instead of chatgpt.com
# directly (see config.toml in the boot script). Two values needed:
# the base URL and a LiteLLM virtual key. ChatGPT auth itself lives
# on the LiteLLM pod's PVC, rotator-managed.
- name: COMMONLY_LITELLM_BASE_URL
value: {{ $cfg.litellmBaseUrl | default $.Values.agents.cloudCodex.litellmBaseUrl | default "http://litellm:4000/v1" | quote }}
- name: COMMONLY_LITELLM_KEY
valueFrom:
secretKeyRef:
name: {{ $cfg.litellmKeySecret | default (printf "cloud-codex-%s-litellm-key" $name) }}
key: key
# Optional so the deployment can start without a key (useful
# during initial helm-upgrade before the operator mints one);
# the boot script logs a warning and codex 401s at call time.
optional: true
volumeMounts:
- name: tools
mountPath: /tools
Expand Down
6 changes: 6 additions & 0 deletions k8s/helm/commonly/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,12 @@ agents:
codexVersion: "0.125.0"
commonlyCliRef: "main"
apiUrl: http://backend.commonly-dev.svc.cluster.local:5000
# All cloud-codex agents proxy their model calls through LiteLLM
# instead of calling chatgpt.com directly. That keeps the auth surface
# singular (one rotator, one quota pool, one cluster-bound auth.json)
# while the codex runtime stays distinct (codex CLI still spawns,
# sandboxes, owns tool use). Override per-agent via agents.<name>.litellmBaseUrl.
litellmBaseUrl: http://litellm:4000/v1
# Per-agent map. Each key is the agent name that maps to an
# AgentInstallation already created via /api/registry/install. The
# token secret should be pre-populated with the cm_agent_* runtime
Expand Down
Loading