OpenHive Kubernetes Baseline

This document describes the preview-era Kubernetes deployment baseline plus the current productized control-plane path for issues #35, #36, #286, #287, #288, #289, #290, #291, #292, and #293.

Scope

The Kubernetes manifests now provide:

separate images for Gateway, Dashboard, Agent runtime, and Sandbox runtime
three namespaces: openhive, hive-agents, hive-sandbox
default-deny-style NetworkPolicy boundaries for agent and sandbox traffic
an init-container bootstrap flow that copies default agent config files into a PVC without overwriting local edits
a full Gateway runtime overlay with strict startup and explicit migration checks
a full platform runtime overlay that adds the Next.js dashboard and a combined Ingress resource

The Kubernetes runtime backend for ContainerAgentPool now exists and creates one Pod plus one Service per managed agent identity, carrying the exact OpenHive agent_id and controller ownership metadata in annotations so orphan reconciliation can remain safe. The Agent and Sandbox images are therefore no longer only probeable placeholders: the Agent image is the real per-agent runtime target when HIVE_AGENT_CONFIG_JSON, HIVE_GATEWAY_URL, and HIVE_INTERNAL_SECRET are provided.

The base manifests still do not wire the main application into ContainerAgentPool; that composition-root selection lands separately. Until that wiring is enabled, the base agent Deployment remains the preview-era health-only baseline while the runtime backend is available for the dynamic per-agent lifecycle path. The sandbox deployment also carries the experimental dev-task API used by the governed workspace-task lane, but that lane is not yet a fully supported preview_local operator workflow.

This Kubernetes baseline therefore complements the source-based preview_local guide rather than replacing it. In local preview docs, the matching sandbox operator surface is still the optional make run-sandbox workflow on port 8091 with HIVE_SANDBOX_URL=http://127.0.0.1:8091.

For the role-by-role runtime contract behind these manifests, see docs/container-runtime-contracts.md.

Build Images

Published images live in GitHub Container Registry:

ghcr.io/terrywangcode/openhive-gateway
ghcr.io/terrywangcode/openhive-agent
ghcr.io/terrywangcode/openhive-sandbox
ghcr.io/terrywangcode/openhive-dashboard

Tag strategy for the supported preview path:

latest tracks the newest successful push from main
sha-<short> is published for every push to main
v* release tags publish the matching semver image tag

The Kubernetes manifests default to the GHCR latest tags. For repeatable preview installs, pin the preview installer env file to either a sha-<short> or release tag before applying a real cluster update.

You can still build the same images locally for debugging:

docker build -f Dockerfile.gateway -t openhive-gateway:dev .
docker build -f Dockerfile.web -t openhive-dashboard:dev .
docker build -f Dockerfile.agent -t openhive-agent:dev .
docker build -f Dockerfile.sandbox -t openhive-sandbox:dev .

The dashboard image uses Next.js standalone output and starts the generated server.js runtime on port 3000.

Preview Install Surface

The supported operator-facing packaging surface for the current preview slice is the narrow env-file driven installer:

cp deploy/k8s/preview-installer/values.env.example /tmp/openhive-preview.env
# edit /tmp/openhive-preview.env with real images, secrets, DB URL, and ingress host

make k8s-preview-plan env_file=/tmp/openhive-preview.env
make k8s-preview-install env_file=/tmp/openhive-preview.env

The installer intentionally keeps scope narrow:

fixed namespaces: openhive, hive-agents, hive-sandbox
fixed secret name: openhive-platform-secrets
supported DB mode: external PostgreSQL only
supported runtime packaging: one migration Job plus one full platform overlay

The values file is the only required operator input surface for this preview path. It captures:

DATABASE_URL
DASHBOARD_SESSION_SECRET
HIVE_INTERNAL_SECRET
HIVE_ADMIN_USERNAME
HIVE_ADMIN_PASSWORD
OPENHIVE_GATEWAY_IMAGE
OPENHIVE_DASHBOARD_IMAGE
OPENHIVE_AGENT_IMAGE
OPENHIVE_SANDBOX_IMAGE
OPENHIVE_PIPELINE_IMAGE (optional; defaults to sandbox image)
OPENHIVE_INGRESS_HOST
OPENHIVE_INGRESS_TLS_SECRET
OPENHIVE_INGRESS_CLASS
OPENHIVE_CERT_MANAGER_CLUSTER_ISSUER

The installer renders a temporary values-specific Kustomize tree, then runs the repeatable sequence:

apply namespaces
apply openhive-platform-secrets
rerun the migration Job and wait for success
apply the full platform runtime
wait for Gateway, Dashboard, Agent, and Sandbox rollouts

Render and Apply

kubectl kustomize deploy/k8s/base
kubectl apply -k deploy/k8s/base

CI and local validation:

make k8s-validate
./scripts/k8s/verify-bootstrap.sh

Full Gateway Runtime Overlay

The base deployment intentionally keeps Gateway on the health-only hive.container.gateway_entrypoint:app entrypoint so low-level image and probe checks stay lightweight.

To run the real OpenHive control plane in-cluster, use:

kubectl kustomize deploy/k8s/overlays/full-gateway-runtime
kubectl apply -k deploy/k8s/overlays/full-gateway-runtime

This overlay keeps the same Service and image, but patches the Gateway deployment to:

start hive.main:app
run under the dedicated hive-gateway-runtime ServiceAccount, with namespace-scoped RBAC in hive-agents and hive-sandbox
mount a writable workspace at /data/hive
select HIVE_POOL_BACKEND=container with HIVE_CONTAINER_RUNTIME_BACKEND=kubernetes so Gateway creates isolated agent runtimes through the Kubernetes backend instead of the in-process local pool
point those isolated runtimes back at Gateway through HIVE_AGENT_RUNTIME_GATEWAY_URL=http://hive-gateway.openhive.svc.cluster.local:8080
declare HIVE_AGENT_RUNTIME_IMAGE=ghcr.io/terrywangcode/openhive-agent:latest and HIVE_K8S_AGENT_NAMESPACE=hive-agents for the first supported lifecycle path
declare HIVE_SANDBOX_URL=http://hive-sandbox.hive-sandbox.svc.cluster.local:8091 so the full runtime path reaches the in-cluster sandbox service explicitly
enable HIVE_STRICT_STARTUP=true so missing secrets or DB startup failures crash the pod instead of serving a degraded control plane
switch startup into HIVE_STARTUP_MIGRATION_MODE=check, which requires the DB schema to already be at the current Alembic head
disable HIVE_METADATA_CREATE_ON_STARTUP so the K8s path does not mutate the schema outside the explicit migration flow
add a startupProbe on /healthz so migrations and startup wiring can complete before liveness checks take over
widen Gateway ingress to the openhive namespace so an in-cluster dashboard can reach the API
allow Gateway egress to TCP 8090 so it can reach per-agent runtime Services
allow Gateway egress to TCP 8091 so it can reach the shared sandbox Service
allow Gateway egress to TCP 5432 for the first external PostgreSQL target
allow sandbox egress to TCP 5432 for the shared PostgreSQL-backed dev-task state
allow sandbox egress to TCP 443 so the direct codex_cli path can reach HTTPS model/provider endpoints when operators enable real provider-backed execution
allow Gateway egress to TCP 443 so container-pool mode can reach the Kubernetes API and the normal HTTPS upstreams used by the full control plane

Create the required secrets before applying the overlay:

kubectl create secret generic openhive-platform-secrets -n openhive \
  --from-literal=database-url='postgresql+asyncpg://hive:password@postgres.example:5432/hive' \
  --from-literal=dashboard-session-secret='replace-with-random-secret' \
  --from-literal=gateway-internal-secret='replace-with-internal-relay-secret' \
  --from-literal=admin-username='admin' \
  --from-literal=admin-password='change-me'

kubectl create secret generic openhive-platform-secrets -n hive-sandbox \
  --from-literal=database-url='postgresql+asyncpg://hive:password@postgres.example:5432/hive' \
  --from-literal=codex-model='gpt-5.4' \
  --from-literal=codex-provider-id='club' \
  --from-literal=codex-provider-name='ai code club' \
  --from-literal=codex-base-url='https://claude-code.club/openai' \
  --from-literal=codex-wire-api='responses' \
  --from-literal=codex-env-key='OPENAI_API_KEY' \
  --from-literal=codex-requires-openai-auth='true' \
  --from-literal=openai-api-key='replace-if-using-real-codex-cli' \
  --from-literal=anthropic-api-key='replace-if-using-real-codex-cli' \
  --from-literal=deepseek-api-key='replace-if-using-real-codex-cli' \
  --from-literal=qwen-api-key='replace-if-using-real-codex-cli'

Minimum env/secret surface for the full runtime overlay:

DATABASE_URL
DASHBOARD_SESSION_SECRET
HIVE_INTERNAL_SECRET
HIVE_ADMIN_USERNAME
HIVE_ADMIN_PASSWORD
HIVE_WORKSPACE=/data/hive
HIVE_POOL_BACKEND=container
HIVE_CONTAINER_RUNTIME_BACKEND=kubernetes
HIVE_AGENT_RUNTIME_GATEWAY_URL=http://hive-gateway.openhive.svc.cluster.local:8080
HIVE_AGENT_RUNTIME_IMAGE=ghcr.io/terrywangcode/openhive-agent:latest
HIVE_K8S_AGENT_NAMESPACE=hive-agents
HIVE_DEPLOYMENT_BACKEND=kubernetes
HIVE_PIPELINE_IMAGE=ghcr.io/terrywangcode/openhive-sandbox:latest
HIVE_K8S_PIPELINE_NAMESPACE=hive-sandbox
HIVE_SANDBOX_URL=http://hive-sandbox.hive-sandbox.svc.cluster.local:8091
sandbox codex_cli auth:
- HIVE_SANDBOX_CODEX_AUTH_MODE=env
- HIVE_SANDBOX_CODEX_SANDBOX_MODE=danger-full-access in the Kubernetes sandbox pod, because the pod itself is the outer isolation boundary and the default inner bubblewrap sandbox is not available in this runtime
- HIVE_SANDBOX_CODEX_ENV_ALLOWLIST=OPENAI_API_KEY,ANTHROPIC_API_KEY,DEEPSEEK_API_KEY,QWEN_API_KEY
- optional provider bootstrap envs HIVE_SANDBOX_CODEX_MODEL, HIVE_SANDBOX_CODEX_PROVIDER_ID, HIVE_SANDBOX_CODEX_PROVIDER_NAME, HIVE_SANDBOX_CODEX_BASE_URL, HIVE_SANDBOX_CODEX_WIRE_API, HIVE_SANDBOX_CODEX_ENV_KEY, and HIVE_SANDBOX_CODEX_REQUIRES_OPENAI_AUTH let the sandbox materialize ~/.codex/config.toml at runtime when a real openai-compatible Codex provider needs more than a bare API key
- optional sandbox-namespace secret keys openai-api-key, anthropic-api-key, deepseek-api-key, qwen-api-key, plus optional codex-model, codex-provider-id, codex-provider-name, codex-base-url, codex-wire-api, codex-env-key, and codex-requires-openai-auth
optional sandbox workspace apply relay:
- HIVE_SANDBOX_WORKSPACE_APPLY_RELAY_URL points to an operator-controlled HTTP endpoint that can apply approved patches to host-local workspaces when the sandbox was seeded from workspace_archive_b64
- HIVE_SANDBOX_WORKSPACE_APPLY_RELAY_TOKEN is sent as a bearer token to that endpoint
- this relay is only for approved patch apply-back; it is distinct from the model-token relay and does not change provider-secret residency claims
- if the relay endpoint is outside the cluster, operators must add a matching sandbox egress rule for that host and port
HIVE_STRICT_STARTUP=true
HIVE_STARTUP_MIGRATION_MODE=check
HIVE_METADATA_CREATE_ON_STARTUP=false

The base sandbox deployment now opts the in-cluster codex_cli path into this explicit provider-env mode by wiring those optional secret refs into the sandbox runtime env. If the secret keys are absent, the pod still starts and the governed codex child receives no provider keys. This direct-env mode is operationally useful for real codex_cli execution, but it is still distinct from the opt-in relay_helper proof path and any future token-based relay flow. When provider bootstrap envs are present, the sandbox entrypoint writes a minimal runtime ~/.codex/config.toml once at startup and leaves that runtime copy in place, so the default codex_cli path can target an OpenAI-compatible provider without baking provider config into the image. The base manifest now mounts a dedicated writable home at /home/codex; using /tmp as HOME causes real non-ephemeral Codex runs to stall in containerized proof/runtime paths.

Supported Kubernetes DB Mode

The first supported Kubernetes DB mode is intentionally narrow:

operator-managed external PostgreSQL
no bundled in-cluster PostgreSQL deployment yet
no automatic DB backup, restore, or lifecycle ownership by OpenHive

That means:

OpenHive expects a reachable PostgreSQL URL in DATABASE_URL
operators own PostgreSQL provisioning, TLS, backup, restore, and retention
OpenHive only owns the migration execution contract described next

Repeatable Migration Job

Run Alembic migrations in-cluster with the dedicated Job manifests before starting or upgrading the full Gateway runtime:

kubectl delete job openhive-db-migrate -n openhive --ignore-not-found=true
kubectl apply -k deploy/k8s/jobs/external-postgres-migration
kubectl wait -n openhive --for=condition=complete job/openhive-db-migrate --timeout=180s
kubectl logs -n openhive job/openhive-db-migrate

The migration Job:

reuses the ghcr.io/terrywangcode/openhive-gateway:latest image
runs alembic upgrade head
reads only DATABASE_URL from the same openhive-platform-secrets secret
is safe to rerun after deleting the previous Job object

The sandbox deployment also reads DATABASE_URL from an openhive-platform-secrets Secret in the hive-sandbox namespace so dev-task creation and review state persist through the shared PostgreSQL database.

Full Platform Runtime Overlay

Use the full platform overlay when you want the operator-facing dashboard, combined ingress, and full Gateway runtime together:

kubectl kustomize deploy/k8s/overlays/full-platform-runtime
kubectl apply -k deploy/k8s/overlays/full-platform-runtime

This overlay builds on top of full-gateway-runtime and adds:

a hive-dashboard Deployment that runs the standalone Next.js server on port 3000
a hive-dashboard Service for in-cluster and port-forward access
HIVE_GATEWAY_INTERNAL_URL=http://hive-gateway.openhive.svc.cluster.local:8080 so dashboard-originated /api/* and /healthz requests stay same-origin to the browser while proxying in-cluster to Gateway
a dedicated dashboard probe endpoint at GET /dashboard-healthz
a combined Ingress that routes /api and /healthz to Gateway and / to the dashboard

The included Ingress is intentionally an example-ready default:

ingressClassName: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
placeholder host openhive.example.com
placeholder TLS secret openhive-example-tls

Before applying in a real cluster, patch deploy/k8s/overlays/full-platform-runtime/platform-ingress.yaml for your hostnames, TLS secret, and ingress-controller annotations.

Without an ingress controller, you can still verify the operator-facing path by port-forwarding the dashboard service:

kubectl port-forward -n openhive svc/hive-dashboard 3000:3000

Then open http://127.0.0.1:3000. Dashboard login and authenticated API calls still work because the dashboard container proxies /api/* and /healthz through the in-cluster Gateway service.

Recommended order for the current productized preview slice:

copy deploy/k8s/preview-installer/values.env.example to a private env file
fill in real secret values, image references, and ingress settings
run make k8s-preview-plan env_file=/path/to/env
run make k8s-preview-install env_file=/path/to/env
verify kubectl logs -n openhive deploy/hive-gateway
verify kubectl logs -n openhive deploy/hive-dashboard
verify dashboard reachability via ingress host or kubectl port-forward

Upgrade and rollback expectations for the supported preview scope:

upgrade by updating image references or secrets in the env file, then rerun make k8s-preview-install
the installer always reruns the migration Job before applying the runtime
rollback is limited to reapplying a previously known-good env file; OpenHive does not provide automatic PostgreSQL rollback or backup orchestration
do not roll back across schema-incompatible releases unless the database has been restored through the operator-owned PostgreSQL process

Failure behavior is intentionally explicit:

if the migration Job fails, do not start the full platform runtime yet
if Gateway starts before migrations are current, HIVE_STARTUP_MIGRATION_MODE=check makes the pod fail with a clear startup error instructing the operator to run the migration Job
if PostgreSQL is unreachable, both the Job and the strict Gateway startup fail fast instead of serving a degraded control plane
if the dashboard cannot reach Gateway, /dashboard-healthz still proves the dashboard container is alive while login and /api/* traffic will surface the upstream connectivity problem

Kubernetes Agent Runtime Backend

With the full Gateway runtime overlay, the application now selects ContainerAgentPool automatically and the Kubernetes runtime backend manages one Pod and one Service per OpenHive agent identity. The first supported resource model is intentionally small and direct:

deterministic runtime name derived from agent_id and safe for Kubernetes DNS
Pod annotations for openhive.io/agent-id and openhive.io/controller-id
Service DNS base URL of the form http://<runtime-name>.hive-agents.svc.cluster.local:8090
per-agent config bootstrap through the existing agent image and bootstrap-config.sh init container
ephemeral emptyDir workspace for the first lifecycle slice

That means operators can map an OpenHive agent runtime row back to a concrete Kubernetes Pod/Service pair without relying on lossy label-safe rewrites of the original agent_id.

Kubernetes Pipeline Workloads

With HIVE_DEPLOYMENT_BACKEND=kubernetes, Keeper's deploy_pipeline and get_pipeline_status tools stop falling back to Docker-only behavior and use a Kubernetes-native workload contract instead.

The first supported mapping is intentionally narrow:

recurring schedules such as 0 * * * * -> CronJob
one-off markers such as an empty schedule or @once -> Job
long-running worker-style Deployment workloads remain out of the first supported slice

The plugin writes the same minimum pipeline manifest that local preview flows already use:

project_id: proj_a
pipeline_name: hourly-classify
schedule: 0 * * * *
steps:
  - fetch
  - classify

In the Kubernetes path, that manifest is rendered into a managed ConfigMap named from the project and pipeline identity, then mounted into the Pipeline workload at:

HIVE_WORKSPACE=/data/pipelines
/data/pipelines/<project_id>/pipeline/pipeline.yaml

The first supported runtime target is the sandbox image:

namespace: hive-sandbox
image: ghcr.io/terrywangcode/openhive-sandbox:latest
command: python -m hive.pipelines.job_entrypoint

Status reporting is also Kubernetes-native:

get_pipeline_status lists managed CronJob and Job workloads
recurring workloads surface the latest child Job state when one failed
operators can confirm the same workload with kubectl get cronjobs,jobs -n hive-sandbox -l openhive.io/pipeline-managed=kubernetes
workload annotations include openhive.io/project-id, openhive.io/pipeline-name, and openhive.io/resource-kind so a failed Job can be mapped back to the OpenHive pipeline without opening the container

Backup And Restore Expectations

In this first Kubernetes DB mode, PostgreSQL backups and restores remain an operator responsibility. Before applying migrations in production-like environments, take an external backup such as:

PGPASSWORD="$DB_PASSWORD" pg_dump \
  --format=custom \
  --host="$DB_HOST" \
  --port="${DB_PORT:-5432}" \
  --username="$DB_USER" \
  --dbname="$DB_NAME" \
  > openhive-pre-migration.dump

OpenHive's migration Job does not create backups, does not drop databases, and does not own restore orchestration for external PostgreSQL.

Operator notes:

the overlay assumes PostgreSQL is reachable on TCP 5432; patch the NetworkPolicy if your managed DB uses a different port or a narrower egress CIDR
the Gateway workspace mount is currently emptyDir, which is enough for startup and API reachability checks but not yet a durable multi-pod storage story
the full platform overlay includes a combined ingress example, but operators still own ingress-controller installation and host-specific TLS configuration

Config Bootstrap Flow

The agent deployment mounts agent-config-pvc at /data/config and runs the config-bootstrap init container before the main container starts.

Bootstrap behavior:

source defaults: /app/defaults/agent
target volume: /data/config
copy mode: only copy files that do not already exist
version marker: write /data/config/.version

To verify no-clobber behavior:

kubectl exec -n hive-agents deploy/hive-agent -- sh -c "echo custom > /data/config/HEARTBEAT.md"
Restart the pod: kubectl rollout restart -n hive-agents deploy/hive-agent
Confirm the edit survived: kubectl exec -n hive-agents deploy/hive-agent -- cat /data/config/HEARTBEAT.md
Update the image tag or bootstrap version, then confirm newly added default files appear without replacing edited files

NetworkPolicy Assertions

These are the minimum checks before closing #35:

Agent pod can reach Gateway on port 8080
Agent pod can reach Sandbox on port 8091
Agent pod cannot reach an arbitrary external address
Sandbox pod can reach Gateway on port 8080
Sandbox pod cannot reach Agent on port 8090

Example manual checks:

kubectl exec -n hive-agents deploy/hive-agent -- wget -qO- http://hive-gateway.openhive.svc.cluster.local:8080/healthz
kubectl exec -n hive-agents deploy/hive-agent -- wget -qO- http://hive-sandbox.hive-sandbox.svc.cluster.local:8091/healthz
kubectl exec -n hive-sandbox deploy/hive-sandbox -- wget -qO- http://hive-gateway.openhive.svc.cluster.local:8080/healthz

Expected blocked checks:

kubectl exec -n hive-agents deploy/hive-agent -- wget -T 3 -qO- https://example.com
kubectl exec -n hive-sandbox deploy/hive-sandbox -- wget -T 3 -qO- http://hive-agent.hive-agents.svc.cluster.local:8090/healthz

Per-Skill External Whitelist

The base sandbox policy denies arbitrary external egress. Add per-environment whitelists through an overlay that appends explicit ipBlock egress rules for the sandbox namespace.

Example:

kubectl kustomize deploy/k8s/overlays/sandbox-whitelist-example

Runtime Contract Summary

The Kubernetes baseline assumes these writable paths:

Gateway: none required in the health-only baseline; /data/hive in the full runtime overlay
Dashboard: none required in the standalone runtime image
Agent: /data/config
Sandbox: /sandbox/commands, /sandbox/tasks, and /tmp

The sandbox deployment mounts explicit writable volumes because the runtime contract keeps the root filesystem read-only while still allowing task-local logs, scratch files, and artifacts.

Operator Diagnostics

Runtime pods now expose enough probe metadata to separate "still starting" from "bootstrapped incorrectly":

status=starting means the runtime has not finished initialization
status=error means bootstrap failed and readiness_reason carries the actionable failure text
agent_id, project_id, controller_id, and deployment_backend map the pod back to OpenHive ownership without relying on log guessing

Useful checks:

kubectl get pod -n hive-agents <runtime-pod> -o jsonpath='{.metadata.annotations}'
kubectl exec -n hive-agents <runtime-pod> -- wget -qO- http://localhost:8090/healthz | jq .
kubectl get cronjobs,jobs -n hive-sandbox -o json | jq '.items[] | {name: .metadata.name, annotations: .metadata.annotations}'

Installer validation steps after make k8s-preview-install:

kubectl get pods -n openhive
kubectl get pods -n hive-agents
kubectl get pods -n hive-sandbox
kubectl logs -n openhive deploy/hive-gateway --tail=100
kubectl logs -n openhive deploy/hive-dashboard --tail=100
kubectl get ingress -n openhive openhive-platform

Keeper dev-task responses also expose sandbox runtime mapping through GET /dev-tasks/{task_id}:

runtime.backend_run_id links the OpenHive task to the backend execution row
runtime.execution_class shows whether the task ran as local, readonly, or networked sandbox work
runtime.artifact_root and runtime.log_root tell operators where to look next

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenHive Kubernetes Baseline

Scope

Build Images

Preview Install Surface

Render and Apply

Full Gateway Runtime Overlay

Supported Kubernetes DB Mode

Repeatable Migration Job

Full Platform Runtime Overlay

Kubernetes Agent Runtime Backend

Kubernetes Pipeline Workloads

Backup And Restore Expectations

Config Bootstrap Flow

NetworkPolicy Assertions

Per-Skill External Whitelist

Runtime Contract Summary

Operator Diagnostics

FilesExpand file tree

deploy-k8s.md

Latest commit

History

deploy-k8s.md

File metadata and controls

OpenHive Kubernetes Baseline

Scope

Build Images

Preview Install Surface

Render and Apply

Full Gateway Runtime Overlay

Supported Kubernetes DB Mode

Repeatable Migration Job

Full Platform Runtime Overlay

Kubernetes Agent Runtime Backend

Kubernetes Pipeline Workloads

Backup And Restore Expectations

Config Bootstrap Flow

NetworkPolicy Assertions

Per-Skill External Whitelist

Runtime Contract Summary

Operator Diagnostics