Skip to content

feat(litellm): in-pod codex device-auth for cluster-bound sessions#365

Merged
samxu01 merged 1 commit into
mainfrom
sprint/litellm-pod-codex-auth
May 15, 2026
Merged

feat(litellm): in-pod codex device-auth for cluster-bound sessions#365
samxu01 merged 1 commit into
mainfrom
sprint/litellm-pod-codex-auth

Conversation

@samxu01
Copy link
Copy Markdown
Contributor

@samxu01 samxu01 commented May 15, 2026

Summary

Moves the cluster-bound device-auth pattern from per-agent pods (cloud-codex) up to the LiteLLM pod so Nova/Pixel and any future codex-CLI agent share one auth surface.

Root cause we keep hitting

ChatGPT binds OAuth sessions to the IP/device that ran device-auth. Tokens device-auth'd on a laptop and uploaded to the cluster get `token_invalidated` on first cluster use. Direct probe today on freshly-uploaded account-1/2 tokens both returned 401 INVALIDATED within seconds.

Fix

Run device-auth from INSIDE the LiteLLM pod. Resulting auth.json is cluster-IP-bound natively. Three changes:

  1. codex-cli sidecar on the LiteLLM pod. Installs codex CLI, idles. Operator runs:
    ```
    kubectl exec -it deploy/litellm -c codex-cli -- /scripts/auth-login.sh 1
    ```
    Completes device-auth in browser; auth.json lands on shared chatgpt-auth volume as `/chatgpt-auth/auth-1.json`. Repeat for accounts 2 and 3.

  2. Rotator prefers pod-side files `/chatgpt-auth/auth-{1,2,3}.json` when present; falls back to env-var-fed tokens otherwise. Existing rotation cadence + 429 signal handling unchanged.

  3. chatgpt-auth can be a PVC via `litellm.chatgptAuth.persistence.enabled`. Required for the cluster-bound flow (emptyDir wipes on every helm-upgrade). Dev enables it.

Plus `strategy.type: Recreate` when the PVC is enabled (RWO single-writer constraint).

Test plan

  • After deploy: `kubectl get pods -n commonly-dev -l app=litellm` shows pod with codex-cli sidecar
  • Operator runs `kubectl exec -it litellm-xxx -c codex-cli -- /scripts/auth-login.sh 1` and completes device-auth — script writes `/chatgpt-auth/auth-1.json`
  • Repeat for accounts 2, 3
  • Rotator log shows `active account-1` etc cycling between pod-side files
  • @nova in demo pod gets a real reply (not "Agent failed before reply")

Follow-up (not in this PR)

  • Switch `cloud-codex-cody` pod to point codex CLI at LiteLLM via `-c model_provider=litellm` + virtual key, so Cody routes through the same auth surface instead of needing her own `/state/.codex/auth.json`.

🤖 Generated with Claude Code

ChatGPT binds OAuth sessions to the IP/device that completed device-auth.
Laptop-device-auth'd tokens uploaded to the cluster get token_invalidated
on first use (confirmed via direct probe today). The cloud-codex-cody pod
already proved the fix: device-auth FROM inside the cluster produces
sessions ChatGPT keeps alive across cluster usage.

This brings that fix one layer up so Nova/Pixel and any future codex
agent share the same auth surface (LiteLLM), rather than each agent
needing its own pod with its own codex login.

What changes:

1. New `codex-cli` sidecar on the LiteLLM pod. Installs codex CLI on
   first boot, idles. Operator runs:
     kubectl exec -it deploy/litellm -c codex-cli -- /scripts/auth-login.sh 1
   Completes device-auth in browser; resulting auth.json lands on the
   shared chatgpt-auth volume as /chatgpt-auth/auth-1.json. Repeat for
   accounts 2 and 3.

2. codex-auth-rotator now PREFERS pod-side /chatgpt-auth/auth-N.json
   files when present, and only falls back to env-var-fed tokens
   (laptop-bound, dead) when no pod-side files exist. Keeps the existing
   rotation cadence + 429 signal handling unchanged.

3. chatgpt-auth volume can be a PVC (values: litellm.chatgptAuth.
   persistence.enabled). Required for the cluster-bound flow — emptyDir
   loses tokens on every pod restart. Dev opts in; defaults stay off
   so OSS deployments aren't surprised.

4. Adds `strategy.type: Recreate` to the LiteLLM Deployment when the
   PVC is enabled — RWO single-writer can't hand off cleanly with
   RollingUpdate.

After this lands + operator does device-auth × N from inside the
codex-cli sidecar, all dev LLM traffic (openclaw moltbot via LiteLLM
chatgpt/ bridge, and any future codex CLI agents pointed at LiteLLM)
uses cluster-bound sessions. Nova/Pixel come back to life without
another laptop device-auth round.

Follow-up: switch cloud-codex-cody to point codex CLI at LiteLLM
(model_provider override + virtual key) so Cody routes through the
same auth surface instead of needing her own /state/.codex/auth.json.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@samxu01 samxu01 merged commit 7dea862 into main May 15, 2026
10 checks passed
samxu01 added a commit that referenced this pull request May 15, 2026
…iner (#366)

* feat(litellm): in-pod codex device-auth for cluster-IP-bound sessions

ChatGPT binds OAuth sessions to the IP/device that completed device-auth.
Laptop-device-auth'd tokens uploaded to the cluster get token_invalidated
on first use (confirmed via direct probe today). The cloud-codex-cody pod
already proved the fix: device-auth FROM inside the cluster produces
sessions ChatGPT keeps alive across cluster usage.

This brings that fix one layer up so Nova/Pixel and any future codex
agent share the same auth surface (LiteLLM), rather than each agent
needing its own pod with its own codex login.

What changes:

1. New `codex-cli` sidecar on the LiteLLM pod. Installs codex CLI on
   first boot, idles. Operator runs:
     kubectl exec -it deploy/litellm -c codex-cli -- /scripts/auth-login.sh 1
   Completes device-auth in browser; resulting auth.json lands on the
   shared chatgpt-auth volume as /chatgpt-auth/auth-1.json. Repeat for
   accounts 2 and 3.

2. codex-auth-rotator now PREFERS pod-side /chatgpt-auth/auth-N.json
   files when present, and only falls back to env-var-fed tokens
   (laptop-bound, dead) when no pod-side files exist. Keeps the existing
   rotation cadence + 429 signal handling unchanged.

3. chatgpt-auth volume can be a PVC (values: litellm.chatgptAuth.
   persistence.enabled). Required for the cluster-bound flow — emptyDir
   loses tokens on every pod restart. Dev opts in; defaults stay off
   so OSS deployments aren't surprised.

4. Adds `strategy.type: Recreate` to the LiteLLM Deployment when the
   PVC is enabled — RWO single-writer can't hand off cleanly with
   RollingUpdate.

After this lands + operator does device-auth × N from inside the
codex-cli sidecar, all dev LLM traffic (openclaw moltbot via LiteLLM
chatgpt/ bridge, and any future codex CLI agents pointed at LiteLLM)
uses cluster-bound sessions. Nova/Pixel come back to life without
another laptop device-auth round.

Follow-up: switch cloud-codex-cody to point codex CLI at LiteLLM
(model_provider override + virtual key) so Cody routes through the
same auth surface instead of needing her own /state/.codex/auth.json.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(litellm): codex-cli is a sidecar (containers:), not an init container

In PR #365 the codex-cli block landed in the initContainers list by
mistake, which made the pod stuck Init:1/2 — codex-cli's sleep loop
never exits, so the pod never progressed to Running, and helm-upgrade
hit the 10m timeout.

Move codex-cli into containers: (sidecar position, after the
codex-auth-rotator). LiteLLM main container can now reach Ready
while codex-cli idles in parallel waiting for operator exec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
samxu01 added a commit that referenced this pull request May 15, 2026
…pt.com (#369)

Multi-runtime ≠ multi-auth-surface. Codex CLI's runtime distinction
(sandbox, tool use, sessions) is independent from where its HTTPS calls
go. Point codex CLI at LiteLLM instead of chatgpt.com so:

- single auth surface across openclaw and codex runtimes
- one rotator, one cluster-bound auth.json (already established by PR #365)
- per-agent codex login --device-auth no longer needed
- per-agent /state/.codex/auth.json no longer needed
- shared quota pool across all agents
- LiteLLM observability captures all model traffic regardless of runtime

What changes:
- Boot script seeds ~/.codex/config.toml with model_provider=litellm,
  base_url pointing at LiteLLM service, wire_api=responses (matches the
  chatgpt/ bridge's Responses-API shape), env_key=LITELLM_API_KEY.
- LITELLM_API_KEY exported from a k8s Secret (cloud-codex-<name>-litellm-key,
  optional so the pod can boot before the key exists; warning logged
  if missing).
- Drops the "wait for /state/.codex/auth.json" gate — no longer needed
  since codex CLI no longer holds its own auth.

Operator setup (per agent):
  1. POST /api/registry/install (cloud-codex/<name>)
  2. Mint AgentInstallation runtime token → secret cloud-codex-<name>-token
  3. Mint LiteLLM virtual key → secret cloud-codex-<name>-litellm-key
  4. helm upgrade — pod boots, no device-auth needed

The cloud-codex pod's PVC still holds /state/.commonly/tokens/<name>.json
(commonly agent run loop's CAP token); only the codex auth.json went away.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant