feat(litellm): in-pod codex device-auth for cluster-bound sessions by samxu01 · Pull Request #365 · Team-Commonly/commonly

samxu01 · 2026-05-15T01:37:06Z

Summary

Moves the cluster-bound device-auth pattern from per-agent pods (cloud-codex) up to the LiteLLM pod so Nova/Pixel and any future codex-CLI agent share one auth surface.

Root cause we keep hitting

ChatGPT binds OAuth sessions to the IP/device that ran device-auth. Tokens device-auth'd on a laptop and uploaded to the cluster get `token_invalidated` on first cluster use. Direct probe today on freshly-uploaded account-1/2 tokens both returned 401 INVALIDATED within seconds.

Fix

Run device-auth from INSIDE the LiteLLM pod. Resulting auth.json is cluster-IP-bound natively. Three changes:

codex-cli sidecar on the LiteLLM pod. Installs codex CLI, idles. Operator runs:
```
kubectl exec -it deploy/litellm -c codex-cli -- /scripts/auth-login.sh 1
```
Completes device-auth in browser; auth.json lands on shared chatgpt-auth volume as `/chatgpt-auth/auth-1.json`. Repeat for accounts 2 and 3.
Rotator prefers pod-side files `/chatgpt-auth/auth-{1,2,3}.json` when present; falls back to env-var-fed tokens otherwise. Existing rotation cadence + 429 signal handling unchanged.
chatgpt-auth can be a PVC via `litellm.chatgptAuth.persistence.enabled`. Required for the cluster-bound flow (emptyDir wipes on every helm-upgrade). Dev enables it.

Plus `strategy.type: Recreate` when the PVC is enabled (RWO single-writer constraint).

Test plan

After deploy: `kubectl get pods -n commonly-dev -l app=litellm` shows pod with codex-cli sidecar
Operator runs `kubectl exec -it litellm-xxx -c codex-cli -- /scripts/auth-login.sh 1` and completes device-auth — script writes `/chatgpt-auth/auth-1.json`
Repeat for accounts 2, 3
Rotator log shows `active account-1` etc cycling between pod-side files
@nova in demo pod gets a real reply (not "Agent failed before reply")

Follow-up (not in this PR)

Switch `cloud-codex-cody` pod to point codex CLI at LiteLLM via `-c model_provider=litellm` + virtual key, so Cody routes through the same auth surface instead of needing her own `/state/.codex/auth.json`.

🤖 Generated with Claude Code

ChatGPT binds OAuth sessions to the IP/device that completed device-auth. Laptop-device-auth'd tokens uploaded to the cluster get token_invalidated on first use (confirmed via direct probe today). The cloud-codex-cody pod already proved the fix: device-auth FROM inside the cluster produces sessions ChatGPT keeps alive across cluster usage. This brings that fix one layer up so Nova/Pixel and any future codex agent share the same auth surface (LiteLLM), rather than each agent needing its own pod with its own codex login. What changes: 1. New `codex-cli` sidecar on the LiteLLM pod. Installs codex CLI on first boot, idles. Operator runs: kubectl exec -it deploy/litellm -c codex-cli -- /scripts/auth-login.sh 1 Completes device-auth in browser; resulting auth.json lands on the shared chatgpt-auth volume as /chatgpt-auth/auth-1.json. Repeat for accounts 2 and 3. 2. codex-auth-rotator now PREFERS pod-side /chatgpt-auth/auth-N.json files when present, and only falls back to env-var-fed tokens (laptop-bound, dead) when no pod-side files exist. Keeps the existing rotation cadence + 429 signal handling unchanged. 3. chatgpt-auth volume can be a PVC (values: litellm.chatgptAuth. persistence.enabled). Required for the cluster-bound flow — emptyDir loses tokens on every pod restart. Dev opts in; defaults stay off so OSS deployments aren't surprised. 4. Adds `strategy.type: Recreate` to the LiteLLM Deployment when the PVC is enabled — RWO single-writer can't hand off cleanly with RollingUpdate. After this lands + operator does device-auth × N from inside the codex-cli sidecar, all dev LLM traffic (openclaw moltbot via LiteLLM chatgpt/ bridge, and any future codex CLI agents pointed at LiteLLM) uses cluster-bound sessions. Nova/Pixel come back to life without another laptop device-auth round. Follow-up: switch cloud-codex-cody to point codex CLI at LiteLLM (model_provider override + virtual key) so Cody routes through the same auth surface instead of needing her own /state/.codex/auth.json. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…iner (#366) * feat(litellm): in-pod codex device-auth for cluster-IP-bound sessions ChatGPT binds OAuth sessions to the IP/device that completed device-auth. Laptop-device-auth'd tokens uploaded to the cluster get token_invalidated on first use (confirmed via direct probe today). The cloud-codex-cody pod already proved the fix: device-auth FROM inside the cluster produces sessions ChatGPT keeps alive across cluster usage. This brings that fix one layer up so Nova/Pixel and any future codex agent share the same auth surface (LiteLLM), rather than each agent needing its own pod with its own codex login. What changes: 1. New `codex-cli` sidecar on the LiteLLM pod. Installs codex CLI on first boot, idles. Operator runs: kubectl exec -it deploy/litellm -c codex-cli -- /scripts/auth-login.sh 1 Completes device-auth in browser; resulting auth.json lands on the shared chatgpt-auth volume as /chatgpt-auth/auth-1.json. Repeat for accounts 2 and 3. 2. codex-auth-rotator now PREFERS pod-side /chatgpt-auth/auth-N.json files when present, and only falls back to env-var-fed tokens (laptop-bound, dead) when no pod-side files exist. Keeps the existing rotation cadence + 429 signal handling unchanged. 3. chatgpt-auth volume can be a PVC (values: litellm.chatgptAuth. persistence.enabled). Required for the cluster-bound flow — emptyDir loses tokens on every pod restart. Dev opts in; defaults stay off so OSS deployments aren't surprised. 4. Adds `strategy.type: Recreate` to the LiteLLM Deployment when the PVC is enabled — RWO single-writer can't hand off cleanly with RollingUpdate. After this lands + operator does device-auth × N from inside the codex-cli sidecar, all dev LLM traffic (openclaw moltbot via LiteLLM chatgpt/ bridge, and any future codex CLI agents pointed at LiteLLM) uses cluster-bound sessions. Nova/Pixel come back to life without another laptop device-auth round. Follow-up: switch cloud-codex-cody to point codex CLI at LiteLLM (model_provider override + virtual key) so Cody routes through the same auth surface instead of needing her own /state/.codex/auth.json. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(litellm): codex-cli is a sidecar (containers:), not an init container In PR #365 the codex-cli block landed in the initContainers list by mistake, which made the pod stuck Init:1/2 — codex-cli's sleep loop never exits, so the pod never progressed to Running, and helm-upgrade hit the 10m timeout. Move codex-cli into containers: (sidecar position, after the codex-auth-rotator). LiteLLM main container can now reach Ready while codex-cli idles in parallel waiting for operator exec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…pt.com (#369) Multi-runtime ≠ multi-auth-surface. Codex CLI's runtime distinction (sandbox, tool use, sessions) is independent from where its HTTPS calls go. Point codex CLI at LiteLLM instead of chatgpt.com so: - single auth surface across openclaw and codex runtimes - one rotator, one cluster-bound auth.json (already established by PR #365) - per-agent codex login --device-auth no longer needed - per-agent /state/.codex/auth.json no longer needed - shared quota pool across all agents - LiteLLM observability captures all model traffic regardless of runtime What changes: - Boot script seeds ~/.codex/config.toml with model_provider=litellm, base_url pointing at LiteLLM service, wire_api=responses (matches the chatgpt/ bridge's Responses-API shape), env_key=LITELLM_API_KEY. - LITELLM_API_KEY exported from a k8s Secret (cloud-codex-<name>-litellm-key, optional so the pod can boot before the key exists; warning logged if missing). - Drops the "wait for /state/.codex/auth.json" gate — no longer needed since codex CLI no longer holds its own auth. Operator setup (per agent): 1. POST /api/registry/install (cloud-codex/<name>) 2. Mint AgentInstallation runtime token → secret cloud-codex-<name>-token 3. Mint LiteLLM virtual key → secret cloud-codex-<name>-litellm-key 4. helm upgrade — pod boots, no device-auth needed The cloud-codex pod's PVC still holds /state/.commonly/tokens/<name>.json (commonly agent run loop's CAP token); only the codex auth.json went away. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

samxu01 merged commit 7dea862 into main May 15, 2026
10 checks passed

samxu01 mentioned this pull request May 15, 2026

fix(litellm): codex-cli is a sidecar (containers:), not an init container #366

Merged

2 tasks

samxu01 mentioned this pull request May 15, 2026

Retire env-var codex auth path from LiteLLM pod (ADR-014 Phase A) #374

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(litellm): in-pod codex device-auth for cluster-bound sessions#365

feat(litellm): in-pod codex device-auth for cluster-bound sessions#365
samxu01 merged 1 commit into
mainfrom
sprint/litellm-pod-codex-auth

samxu01 commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samxu01 commented May 15, 2026

Summary

Root cause we keep hitting

Fix

Test plan

Follow-up (not in this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant