This document describes the preview-era Kubernetes deployment baseline plus the
current productized control-plane path for issues #35, #36, #286, #287,
#288, #289, #290, #291, #292, and #293.
The Kubernetes manifests now provide:
- separate images for Gateway, Dashboard, Agent runtime, and Sandbox runtime
- three namespaces:
openhive,hive-agents,hive-sandbox - default-deny-style NetworkPolicy boundaries for agent and sandbox traffic
- an init-container bootstrap flow that copies default agent config files into a PVC without overwriting local edits
- a full Gateway runtime overlay with strict startup and explicit migration checks
- a full platform runtime overlay that adds the Next.js dashboard and a combined Ingress resource
The Kubernetes runtime backend for ContainerAgentPool now exists and creates
one Pod plus one Service per managed agent identity, carrying the exact
OpenHive agent_id and controller ownership metadata in annotations so orphan
reconciliation can remain safe. The Agent and Sandbox images are therefore no
longer only probeable placeholders: the Agent image is the real per-agent
runtime target when HIVE_AGENT_CONFIG_JSON, HIVE_GATEWAY_URL, and
HIVE_INTERNAL_SECRET are provided.
The base manifests still do not wire the main application into
ContainerAgentPool; that composition-root selection lands separately. Until
that wiring is enabled, the base agent Deployment remains the preview-era
health-only baseline while the runtime backend is available for the dynamic
per-agent lifecycle path. The sandbox deployment also carries the experimental
dev-task API used by the governed workspace-task lane, but that lane is not
yet a fully supported preview_local operator workflow.
This Kubernetes baseline therefore complements the source-based
preview_local guide rather than replacing it. In local preview docs, the
matching sandbox operator surface is still the optional make run-sandbox
workflow on port 8091 with HIVE_SANDBOX_URL=http://127.0.0.1:8091.
For the role-by-role runtime contract behind these manifests, see
docs/container-runtime-contracts.md.
Published images live in GitHub Container Registry:
ghcr.io/terrywangcode/openhive-gatewayghcr.io/terrywangcode/openhive-agentghcr.io/terrywangcode/openhive-sandboxghcr.io/terrywangcode/openhive-dashboard
Tag strategy for the supported preview path:
latesttracks the newest successful push frommainsha-<short>is published for every push tomainv*release tags publish the matching semver image tag
The Kubernetes manifests default to the GHCR latest tags. For repeatable
preview installs, pin the preview installer env file to either a sha-<short>
or release tag before applying a real cluster update.
You can still build the same images locally for debugging:
docker build -f Dockerfile.gateway -t openhive-gateway:dev .
docker build -f Dockerfile.web -t openhive-dashboard:dev .
docker build -f Dockerfile.agent -t openhive-agent:dev .
docker build -f Dockerfile.sandbox -t openhive-sandbox:dev .The dashboard image uses Next.js standalone output and starts the generated
server.js runtime on port 3000.
The supported operator-facing packaging surface for the current preview slice is the narrow env-file driven installer:
cp deploy/k8s/preview-installer/values.env.example /tmp/openhive-preview.env
# edit /tmp/openhive-preview.env with real images, secrets, DB URL, and ingress host
make k8s-preview-plan env_file=/tmp/openhive-preview.env
make k8s-preview-install env_file=/tmp/openhive-preview.envThe installer intentionally keeps scope narrow:
- fixed namespaces:
openhive,hive-agents,hive-sandbox - fixed secret name:
openhive-platform-secrets - supported DB mode: external PostgreSQL only
- supported runtime packaging: one migration Job plus one full platform overlay
The values file is the only required operator input surface for this preview path. It captures:
DATABASE_URLDASHBOARD_SESSION_SECRETHIVE_INTERNAL_SECRETHIVE_ADMIN_USERNAMEHIVE_ADMIN_PASSWORDOPENHIVE_GATEWAY_IMAGEOPENHIVE_DASHBOARD_IMAGEOPENHIVE_AGENT_IMAGEOPENHIVE_SANDBOX_IMAGEOPENHIVE_PIPELINE_IMAGE(optional; defaults to sandbox image)OPENHIVE_INGRESS_HOSTOPENHIVE_INGRESS_TLS_SECRETOPENHIVE_INGRESS_CLASSOPENHIVE_CERT_MANAGER_CLUSTER_ISSUER
The installer renders a temporary values-specific Kustomize tree, then runs the repeatable sequence:
- apply namespaces
- apply
openhive-platform-secrets - rerun the migration Job and wait for success
- apply the full platform runtime
- wait for Gateway, Dashboard, Agent, and Sandbox rollouts
kubectl kustomize deploy/k8s/base
kubectl apply -k deploy/k8s/baseCI and local validation:
make k8s-validate
./scripts/k8s/verify-bootstrap.shThe base deployment intentionally keeps Gateway on the health-only
hive.container.gateway_entrypoint:app entrypoint so low-level image and probe
checks stay lightweight.
To run the real OpenHive control plane in-cluster, use:
kubectl kustomize deploy/k8s/overlays/full-gateway-runtime
kubectl apply -k deploy/k8s/overlays/full-gateway-runtimeThis overlay keeps the same Service and image, but patches the Gateway deployment to:
- start
hive.main:app - run under the dedicated
hive-gateway-runtimeServiceAccount, with namespace-scoped RBAC inhive-agentsandhive-sandbox - mount a writable workspace at
/data/hive - select
HIVE_POOL_BACKEND=containerwithHIVE_CONTAINER_RUNTIME_BACKEND=kubernetesso Gateway creates isolated agent runtimes through the Kubernetes backend instead of the in-process local pool - point those isolated runtimes back at Gateway through
HIVE_AGENT_RUNTIME_GATEWAY_URL=http://hive-gateway.openhive.svc.cluster.local:8080 - declare
HIVE_AGENT_RUNTIME_IMAGE=ghcr.io/terrywangcode/openhive-agent:latestandHIVE_K8S_AGENT_NAMESPACE=hive-agentsfor the first supported lifecycle path - declare
HIVE_SANDBOX_URL=http://hive-sandbox.hive-sandbox.svc.cluster.local:8091so the full runtime path reaches the in-cluster sandbox service explicitly - enable
HIVE_STRICT_STARTUP=trueso missing secrets or DB startup failures crash the pod instead of serving a degraded control plane - switch startup into
HIVE_STARTUP_MIGRATION_MODE=check, which requires the DB schema to already be at the current Alembic head - disable
HIVE_METADATA_CREATE_ON_STARTUPso the K8s path does not mutate the schema outside the explicit migration flow - add a
startupProbeon/healthzso migrations and startup wiring can complete before liveness checks take over - widen Gateway ingress to the
openhivenamespace so an in-cluster dashboard can reach the API - allow Gateway egress to TCP
8090so it can reach per-agent runtime Services - allow Gateway egress to TCP
8091so it can reach the shared sandbox Service - allow Gateway egress to TCP
5432for the first external PostgreSQL target - allow sandbox egress to TCP
5432for the shared PostgreSQL-backed dev-task state - allow sandbox egress to TCP
443so the directcodex_clipath can reach HTTPS model/provider endpoints when operators enable real provider-backed execution - allow Gateway egress to TCP
443so container-pool mode can reach the Kubernetes API and the normal HTTPS upstreams used by the full control plane
Create the required secrets before applying the overlay:
kubectl create secret generic openhive-platform-secrets -n openhive \
--from-literal=database-url='postgresql+asyncpg://hive:password@postgres.example:5432/hive' \
--from-literal=dashboard-session-secret='replace-with-random-secret' \
--from-literal=gateway-internal-secret='replace-with-internal-relay-secret' \
--from-literal=admin-username='admin' \
--from-literal=admin-password='change-me'
kubectl create secret generic openhive-platform-secrets -n hive-sandbox \
--from-literal=database-url='postgresql+asyncpg://hive:password@postgres.example:5432/hive' \
--from-literal=codex-model='gpt-5.4' \
--from-literal=codex-provider-id='club' \
--from-literal=codex-provider-name='ai code club' \
--from-literal=codex-base-url='https://claude-code.club/openai' \
--from-literal=codex-wire-api='responses' \
--from-literal=codex-env-key='OPENAI_API_KEY' \
--from-literal=codex-requires-openai-auth='true' \
--from-literal=openai-api-key='replace-if-using-real-codex-cli' \
--from-literal=anthropic-api-key='replace-if-using-real-codex-cli' \
--from-literal=deepseek-api-key='replace-if-using-real-codex-cli' \
--from-literal=qwen-api-key='replace-if-using-real-codex-cli'Minimum env/secret surface for the full runtime overlay:
DATABASE_URLDASHBOARD_SESSION_SECRETHIVE_INTERNAL_SECRETHIVE_ADMIN_USERNAMEHIVE_ADMIN_PASSWORDHIVE_WORKSPACE=/data/hiveHIVE_POOL_BACKEND=containerHIVE_CONTAINER_RUNTIME_BACKEND=kubernetesHIVE_AGENT_RUNTIME_GATEWAY_URL=http://hive-gateway.openhive.svc.cluster.local:8080HIVE_AGENT_RUNTIME_IMAGE=ghcr.io/terrywangcode/openhive-agent:latestHIVE_K8S_AGENT_NAMESPACE=hive-agentsHIVE_DEPLOYMENT_BACKEND=kubernetesHIVE_PIPELINE_IMAGE=ghcr.io/terrywangcode/openhive-sandbox:latestHIVE_K8S_PIPELINE_NAMESPACE=hive-sandboxHIVE_SANDBOX_URL=http://hive-sandbox.hive-sandbox.svc.cluster.local:8091- sandbox
codex_cliauth:HIVE_SANDBOX_CODEX_AUTH_MODE=envHIVE_SANDBOX_CODEX_SANDBOX_MODE=danger-full-accessin the Kubernetes sandbox pod, because the pod itself is the outer isolation boundary and the default inner bubblewrap sandbox is not available in this runtimeHIVE_SANDBOX_CODEX_ENV_ALLOWLIST=OPENAI_API_KEY,ANTHROPIC_API_KEY,DEEPSEEK_API_KEY,QWEN_API_KEY- optional provider bootstrap envs
HIVE_SANDBOX_CODEX_MODEL,HIVE_SANDBOX_CODEX_PROVIDER_ID,HIVE_SANDBOX_CODEX_PROVIDER_NAME,HIVE_SANDBOX_CODEX_BASE_URL,HIVE_SANDBOX_CODEX_WIRE_API,HIVE_SANDBOX_CODEX_ENV_KEY, andHIVE_SANDBOX_CODEX_REQUIRES_OPENAI_AUTHlet the sandbox materialize~/.codex/config.tomlat runtime when a real openai-compatible Codex provider needs more than a bare API key - optional sandbox-namespace secret keys
openai-api-key,anthropic-api-key,deepseek-api-key,qwen-api-key, plus optionalcodex-model,codex-provider-id,codex-provider-name,codex-base-url,codex-wire-api,codex-env-key, andcodex-requires-openai-auth
- optional sandbox workspace apply relay:
HIVE_SANDBOX_WORKSPACE_APPLY_RELAY_URLpoints to an operator-controlled HTTP endpoint that can apply approved patches to host-local workspaces when the sandbox was seeded fromworkspace_archive_b64HIVE_SANDBOX_WORKSPACE_APPLY_RELAY_TOKENis sent as a bearer token to that endpoint- this relay is only for approved patch apply-back; it is distinct from the model-token relay and does not change provider-secret residency claims
- if the relay endpoint is outside the cluster, operators must add a matching sandbox egress rule for that host and port
HIVE_STRICT_STARTUP=trueHIVE_STARTUP_MIGRATION_MODE=checkHIVE_METADATA_CREATE_ON_STARTUP=false
The base sandbox deployment now opts the in-cluster codex_cli path into this
explicit provider-env mode by wiring those optional secret refs into the sandbox
runtime env. If the secret keys are absent, the pod still starts and the
governed codex child receives no provider keys. This direct-env mode is
operationally useful for real codex_cli execution, but it is still distinct
from the opt-in relay_helper proof path and any future token-based relay flow.
When provider bootstrap envs are present, the sandbox entrypoint writes a
minimal runtime ~/.codex/config.toml once at startup and leaves that runtime
copy in place, so the default codex_cli path can target an OpenAI-compatible
provider without baking provider config into the image. The base manifest now
mounts a dedicated writable home at /home/codex; using /tmp as HOME
causes real non-ephemeral Codex runs to stall in containerized proof/runtime
paths.
The first supported Kubernetes DB mode is intentionally narrow:
- operator-managed external PostgreSQL
- no bundled in-cluster PostgreSQL deployment yet
- no automatic DB backup, restore, or lifecycle ownership by OpenHive
That means:
- OpenHive expects a reachable PostgreSQL URL in
DATABASE_URL - operators own PostgreSQL provisioning, TLS, backup, restore, and retention
- OpenHive only owns the migration execution contract described next
Run Alembic migrations in-cluster with the dedicated Job manifests before starting or upgrading the full Gateway runtime:
kubectl delete job openhive-db-migrate -n openhive --ignore-not-found=true
kubectl apply -k deploy/k8s/jobs/external-postgres-migration
kubectl wait -n openhive --for=condition=complete job/openhive-db-migrate --timeout=180s
kubectl logs -n openhive job/openhive-db-migrateThe migration Job:
- reuses the
ghcr.io/terrywangcode/openhive-gateway:latestimage - runs
alembic upgrade head - reads only
DATABASE_URLfrom the sameopenhive-platform-secretssecret - is safe to rerun after deleting the previous Job object
The sandbox deployment also reads DATABASE_URL from an
openhive-platform-secrets Secret in the hive-sandbox namespace so dev-task
creation and review state persist through the shared PostgreSQL database.
Use the full platform overlay when you want the operator-facing dashboard, combined ingress, and full Gateway runtime together:
kubectl kustomize deploy/k8s/overlays/full-platform-runtime
kubectl apply -k deploy/k8s/overlays/full-platform-runtimeThis overlay builds on top of full-gateway-runtime and adds:
- a
hive-dashboardDeployment that runs the standalone Next.js server on port3000 - a
hive-dashboardService for in-cluster and port-forward access HIVE_GATEWAY_INTERNAL_URL=http://hive-gateway.openhive.svc.cluster.local:8080so dashboard-originated/api/*and/healthzrequests stay same-origin to the browser while proxying in-cluster to Gateway- a dedicated dashboard probe endpoint at
GET /dashboard-healthz - a combined
Ingressthat routes/apiand/healthzto Gateway and/to the dashboard
The included Ingress is intentionally an example-ready default:
ingressClassName: nginxcert-manager.io/cluster-issuer: letsencrypt-prod- placeholder host
openhive.example.com - placeholder TLS secret
openhive-example-tls
Before applying in a real cluster, patch deploy/k8s/overlays/full-platform-runtime/platform-ingress.yaml
for your hostnames, TLS secret, and ingress-controller annotations.
Without an ingress controller, you can still verify the operator-facing path by port-forwarding the dashboard service:
kubectl port-forward -n openhive svc/hive-dashboard 3000:3000Then open http://127.0.0.1:3000. Dashboard login and authenticated API calls
still work because the dashboard container proxies /api/* and /healthz
through the in-cluster Gateway service.
Recommended order for the current productized preview slice:
- copy
deploy/k8s/preview-installer/values.env.exampleto a private env file - fill in real secret values, image references, and ingress settings
- run
make k8s-preview-plan env_file=/path/to/env - run
make k8s-preview-install env_file=/path/to/env - verify
kubectl logs -n openhive deploy/hive-gateway - verify
kubectl logs -n openhive deploy/hive-dashboard - verify dashboard reachability via ingress host or
kubectl port-forward
Upgrade and rollback expectations for the supported preview scope:
- upgrade by updating image references or secrets in the env file, then rerun
make k8s-preview-install - the installer always reruns the migration Job before applying the runtime
- rollback is limited to reapplying a previously known-good env file; OpenHive does not provide automatic PostgreSQL rollback or backup orchestration
- do not roll back across schema-incompatible releases unless the database has been restored through the operator-owned PostgreSQL process
Failure behavior is intentionally explicit:
- if the migration Job fails, do not start the full platform runtime yet
- if Gateway starts before migrations are current,
HIVE_STARTUP_MIGRATION_MODE=checkmakes the pod fail with a clear startup error instructing the operator to run the migration Job - if PostgreSQL is unreachable, both the Job and the strict Gateway startup fail fast instead of serving a degraded control plane
- if the dashboard cannot reach Gateway,
/dashboard-healthzstill proves the dashboard container is alive while login and/api/*traffic will surface the upstream connectivity problem
With the full Gateway runtime overlay, the application now selects
ContainerAgentPool automatically and the Kubernetes runtime backend manages
one Pod and one Service per OpenHive agent identity. The first supported
resource model is intentionally small and direct:
- deterministic runtime name derived from
agent_idand safe for Kubernetes DNS - Pod annotations for
openhive.io/agent-idandopenhive.io/controller-id - Service DNS base URL of the form
http://<runtime-name>.hive-agents.svc.cluster.local:8090 - per-agent config bootstrap through the existing agent image and
bootstrap-config.shinit container - ephemeral
emptyDirworkspace for the first lifecycle slice
That means operators can map an OpenHive agent runtime row back to a concrete
Kubernetes Pod/Service pair without relying on lossy label-safe rewrites of the
original agent_id.
With HIVE_DEPLOYMENT_BACKEND=kubernetes, Keeper's deploy_pipeline and
get_pipeline_status tools stop falling back to Docker-only behavior and use a
Kubernetes-native workload contract instead.
The first supported mapping is intentionally narrow:
- recurring schedules such as
0 * * * *->CronJob - one-off markers such as an empty schedule or
@once->Job - long-running worker-style
Deploymentworkloads remain out of the first supported slice
The plugin writes the same minimum pipeline manifest that local preview flows already use:
project_id: proj_a
pipeline_name: hourly-classify
schedule: 0 * * * *
steps:
- fetch
- classifyIn the Kubernetes path, that manifest is rendered into a managed ConfigMap
named from the project and pipeline identity, then mounted into the Pipeline
workload at:
HIVE_WORKSPACE=/data/pipelines/data/pipelines/<project_id>/pipeline/pipeline.yaml
The first supported runtime target is the sandbox image:
- namespace:
hive-sandbox - image:
ghcr.io/terrywangcode/openhive-sandbox:latest - command:
python -m hive.pipelines.job_entrypoint
Status reporting is also Kubernetes-native:
get_pipeline_statuslists managedCronJobandJobworkloads- recurring workloads surface the latest child Job state when one failed
- operators can confirm the same workload with
kubectl get cronjobs,jobs -n hive-sandbox -l openhive.io/pipeline-managed=kubernetes - workload annotations include
openhive.io/project-id,openhive.io/pipeline-name, andopenhive.io/resource-kindso a failed Job can be mapped back to the OpenHive pipeline without opening the container
In this first Kubernetes DB mode, PostgreSQL backups and restores remain an operator responsibility. Before applying migrations in production-like environments, take an external backup such as:
PGPASSWORD="$DB_PASSWORD" pg_dump \
--format=custom \
--host="$DB_HOST" \
--port="${DB_PORT:-5432}" \
--username="$DB_USER" \
--dbname="$DB_NAME" \
> openhive-pre-migration.dumpOpenHive's migration Job does not create backups, does not drop databases, and does not own restore orchestration for external PostgreSQL.
Operator notes:
- the overlay assumes PostgreSQL is reachable on TCP
5432; patch the NetworkPolicy if your managed DB uses a different port or a narrower egress CIDR - the Gateway workspace mount is currently
emptyDir, which is enough for startup and API reachability checks but not yet a durable multi-pod storage story - the full platform overlay includes a combined ingress example, but operators still own ingress-controller installation and host-specific TLS configuration
The agent deployment mounts agent-config-pvc at /data/config and runs the
config-bootstrap init container before the main container starts.
Bootstrap behavior:
- source defaults:
/app/defaults/agent - target volume:
/data/config - copy mode: only copy files that do not already exist
- version marker: write
/data/config/.version
To verify no-clobber behavior:
kubectl exec -n hive-agents deploy/hive-agent -- sh -c "echo custom > /data/config/HEARTBEAT.md"- Restart the pod:
kubectl rollout restart -n hive-agents deploy/hive-agent - Confirm the edit survived:
kubectl exec -n hive-agents deploy/hive-agent -- cat /data/config/HEARTBEAT.md - Update the image tag or bootstrap version, then confirm newly added default files appear without replacing edited files
These are the minimum checks before closing #35:
- Agent pod can reach Gateway on port
8080 - Agent pod can reach Sandbox on port
8091 - Agent pod cannot reach an arbitrary external address
- Sandbox pod can reach Gateway on port
8080 - Sandbox pod cannot reach Agent on port
8090
Example manual checks:
kubectl exec -n hive-agents deploy/hive-agent -- wget -qO- http://hive-gateway.openhive.svc.cluster.local:8080/healthz
kubectl exec -n hive-agents deploy/hive-agent -- wget -qO- http://hive-sandbox.hive-sandbox.svc.cluster.local:8091/healthz
kubectl exec -n hive-sandbox deploy/hive-sandbox -- wget -qO- http://hive-gateway.openhive.svc.cluster.local:8080/healthzExpected blocked checks:
kubectl exec -n hive-agents deploy/hive-agent -- wget -T 3 -qO- https://example.com
kubectl exec -n hive-sandbox deploy/hive-sandbox -- wget -T 3 -qO- http://hive-agent.hive-agents.svc.cluster.local:8090/healthzThe base sandbox policy denies arbitrary external egress. Add per-environment
whitelists through an overlay that appends explicit ipBlock egress rules for
the sandbox namespace.
Example:
kubectl kustomize deploy/k8s/overlays/sandbox-whitelist-exampleThe Kubernetes baseline assumes these writable paths:
- Gateway: none required in the health-only baseline;
/data/hivein the full runtime overlay - Dashboard: none required in the standalone runtime image
- Agent:
/data/config - Sandbox:
/sandbox/commands,/sandbox/tasks, and/tmp
The sandbox deployment mounts explicit writable volumes because the runtime contract keeps the root filesystem read-only while still allowing task-local logs, scratch files, and artifacts.
Runtime pods now expose enough probe metadata to separate "still starting" from "bootstrapped incorrectly":
status=startingmeans the runtime has not finished initializationstatus=errormeans bootstrap failed andreadiness_reasoncarries the actionable failure textagent_id,project_id,controller_id, anddeployment_backendmap the pod back to OpenHive ownership without relying on log guessing
Useful checks:
kubectl get pod -n hive-agents <runtime-pod> -o jsonpath='{.metadata.annotations}'
kubectl exec -n hive-agents <runtime-pod> -- wget -qO- http://localhost:8090/healthz | jq .
kubectl get cronjobs,jobs -n hive-sandbox -o json | jq '.items[] | {name: .metadata.name, annotations: .metadata.annotations}'Installer validation steps after make k8s-preview-install:
kubectl get pods -n openhive
kubectl get pods -n hive-agents
kubectl get pods -n hive-sandbox
kubectl logs -n openhive deploy/hive-gateway --tail=100
kubectl logs -n openhive deploy/hive-dashboard --tail=100
kubectl get ingress -n openhive openhive-platformKeeper dev-task responses also expose sandbox runtime mapping through
GET /dev-tasks/{task_id}:
runtime.backend_run_idlinks the OpenHive task to the backend execution rowruntime.execution_classshows whether the task ran as local, readonly, or networked sandbox workruntime.artifact_rootandruntime.log_roottell operators where to look next