Skip to content

fix(cloud-codex): tokens/<name>.json + drop --instance + ca-certificates#363

Merged
lilyshen0722 merged 3 commits into
mainfrom
sprint/cloud-codex-bootfix
May 14, 2026
Merged

fix(cloud-codex): tokens/<name>.json + drop --instance + ca-certificates#363
lilyshen0722 merged 3 commits into
mainfrom
sprint/cloud-codex-bootfix

Conversation

@lilyshen0722
Copy link
Copy Markdown
Contributor

Summary

Three boot-time issues caught running the first cloud-codex-cody pod:

  1. `commonly agent run` doesn't take `--instance` — exit "unknown option". The token+instance resolution happens via per-agent token file, not flags on `run`.
  2. Wrong token file path — was writing `/.commonly/config.json` with `agentTokens` block; CLI's `loadAgentToken()` reads `/.commonly/tokens/.json`.
  3. Missing ca-certificates — `node:22-bookworm-slim` ships without TLS roots, so codex CLI's HTTPS to auth.openai.com / api.openai.com fails ("error sending request"). Install at boot (idempotent).

Also splits `COMMONLY_REGISTRY_AGENT_NAME` from `COMMONLY_AGENT_NAME` so the local file alias and the server-side registry agentName can diverge (Cody's local alias is "cody"; her registry install is "cloud-codex").

Test plan

  • After deploy: `kubectl logs deploy/cloud-codex-cody` shows ca-certificates installed → token file written → "auth.json found, starting commonly agent run cody" → poller running
  • @cody mention in the demo pod gets a reply

🤖 Generated with Claude Code

lilyshen0722 and others added 3 commits May 13, 2026 09:28
…uster

ADR-005 variant of the sam-local-codex laptop wrapper: same `commonly agent
run <name>` poll loop + codex CLI, but running in a cluster-side pod
instead of an operator's machine.

Motivation: ChatGPT binds OAuth sessions to the IP/device that completed
device-auth. A session device-auth'd on a laptop and then used by LiteLLM
from the cluster IP gets `token_invalidated` immediately — confirmed
empirically on dev today (probe of fresh tokens against /backend-api/codex/
responses from the cluster returned 401 INVALIDATED within seconds of
upload). When `codex login --device-auth` runs INSIDE this pod, the
cluster IP signs the device-auth AND signs subsequent CLI calls — no
mismatch, no anti-abuse revoke.

What this PR adds:
- `templates/agents/cloud-codex-deployment.yaml`: ranged Deployment + PVC
  per `.Values.agents.cloudCodex.agents.<name>`. Mirrors the
  clawdbot-deployment codex-tools-installer init container for binary
  setup, then main container runs `commonly agent run <name>`.
- `values.yaml`: top-level `agents.cloudCodex` block (disabled by default).
- `values-dev.yaml`: enables one agent (`cody`) bound to the demo pod.

Operator flow (one-time per agent install):
  1. POST /api/registry/install with agentName=cloud-codex + instanceId=<name>
  2. Mint a runtime token; put in k8s Secret cloud-codex-<name>-token
  3. `helm upgrade` — pod boots; init installs CLIs
  4. `kubectl exec -it deploy/cloud-codex-<name> -- codex login --device-auth`
     (completes in operator's browser; auth.json lands on the PVC at
     /state/.codex/auth.json — bound to cluster IP)
  5. Restart pod; main container picks up auth.json and starts the run
     loop. Replies use codex CLI in this pod, so OpenAI sees one stable
     client = one stable session = no invalidation.

Identity continuity (ADR-001 §3): AgentInstallation + User row predate
this pod and survive its restart/redeploy. The pod is just runtime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three issues caught when the first cloud-codex-cody pod tried to boot
into the agent run loop:

1. `commonly agent run <name>` doesn't accept a `--instance` flag —
   the template was passing it and the CLI exited "unknown option".
   The instance/token resolution happens via the per-agent token file,
   not via the run subcommand.

2. The container was writing `~/.commonly/config.json` with an
   `agentTokens` block — that's not the shape `loadAgentToken()` reads.
   The CLI looks at `~/.commonly/tokens/<name>.json` (one file per
   agent, per saveAgentToken in cli/src/commands/agent.js). Switch to
   that file shape so the run subcommand actually finds the token.

3. node:22-bookworm-slim doesn't ship ca-certificates, so codex CLI's
   outbound TLS to auth.openai.com / api.openai.com fails ("error
   sending request"). Install ca-certificates at boot (idempotent —
   apt skips what's there) so the device-auth + run loop work.

Also split COMMONLY_REGISTRY_AGENT_NAME from COMMONLY_AGENT_NAME so
the local file alias and the server-side registry agentName can
diverge (Cody's local alias is "cody"; her registry install is
"cloud-codex").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lilyshen0722 lilyshen0722 merged commit e8a8724 into main May 14, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant