Skip to content

Latest commit

 

History

History
573 lines (450 loc) · 24.4 KB

File metadata and controls

573 lines (450 loc) · 24.4 KB

OpenHive Kubernetes Baseline

This document describes the preview-era Kubernetes deployment baseline plus the current productized control-plane path for issues #35, #36, #286, #287, #288, #289, #290, #291, #292, and #293.

Scope

The Kubernetes manifests now provide:

  • separate images for Gateway, Dashboard, Agent runtime, and Sandbox runtime
  • three namespaces: openhive, hive-agents, hive-sandbox
  • default-deny-style NetworkPolicy boundaries for agent and sandbox traffic
  • an init-container bootstrap flow that copies default agent config files into a PVC without overwriting local edits
  • a full Gateway runtime overlay with strict startup and explicit migration checks
  • a full platform runtime overlay that adds the Next.js dashboard and a combined Ingress resource

The Kubernetes runtime backend for ContainerAgentPool now exists and creates one Pod plus one Service per managed agent identity, carrying the exact OpenHive agent_id and controller ownership metadata in annotations so orphan reconciliation can remain safe. The Agent and Sandbox images are therefore no longer only probeable placeholders: the Agent image is the real per-agent runtime target when HIVE_AGENT_CONFIG_JSON, HIVE_GATEWAY_URL, and HIVE_INTERNAL_SECRET are provided.

The base manifests still do not wire the main application into ContainerAgentPool; that composition-root selection lands separately. Until that wiring is enabled, the base agent Deployment remains the preview-era health-only baseline while the runtime backend is available for the dynamic per-agent lifecycle path. The sandbox deployment also carries the experimental dev-task API used by the governed workspace-task lane, but that lane is not yet a fully supported preview_local operator workflow.

This Kubernetes baseline therefore complements the source-based preview_local guide rather than replacing it. In local preview docs, the matching sandbox operator surface is still the optional make run-sandbox workflow on port 8091 with HIVE_SANDBOX_URL=http://127.0.0.1:8091.

For the role-by-role runtime contract behind these manifests, see docs/container-runtime-contracts.md.

Build Images

Published images live in GitHub Container Registry:

  • ghcr.io/terrywangcode/openhive-gateway
  • ghcr.io/terrywangcode/openhive-agent
  • ghcr.io/terrywangcode/openhive-sandbox
  • ghcr.io/terrywangcode/openhive-dashboard

Tag strategy for the supported preview path:

  • latest tracks the newest successful push from main
  • sha-<short> is published for every push to main
  • v* release tags publish the matching semver image tag

The Kubernetes manifests default to the GHCR latest tags. For repeatable preview installs, pin the preview installer env file to either a sha-<short> or release tag before applying a real cluster update.

You can still build the same images locally for debugging:

docker build -f Dockerfile.gateway -t openhive-gateway:dev .
docker build -f Dockerfile.web -t openhive-dashboard:dev .
docker build -f Dockerfile.agent -t openhive-agent:dev .
docker build -f Dockerfile.sandbox -t openhive-sandbox:dev .

The dashboard image uses Next.js standalone output and starts the generated server.js runtime on port 3000.

Preview Install Surface

The supported operator-facing packaging surface for the current preview slice is the narrow env-file driven installer:

cp deploy/k8s/preview-installer/values.env.example /tmp/openhive-preview.env
# edit /tmp/openhive-preview.env with real images, secrets, DB URL, and ingress host

make k8s-preview-plan env_file=/tmp/openhive-preview.env
make k8s-preview-install env_file=/tmp/openhive-preview.env

The installer intentionally keeps scope narrow:

  • fixed namespaces: openhive, hive-agents, hive-sandbox
  • fixed secret name: openhive-platform-secrets
  • supported DB mode: external PostgreSQL only
  • supported runtime packaging: one migration Job plus one full platform overlay

The values file is the only required operator input surface for this preview path. It captures:

  • DATABASE_URL
  • DASHBOARD_SESSION_SECRET
  • HIVE_INTERNAL_SECRET
  • HIVE_ADMIN_USERNAME
  • HIVE_ADMIN_PASSWORD
  • OPENHIVE_GATEWAY_IMAGE
  • OPENHIVE_DASHBOARD_IMAGE
  • OPENHIVE_AGENT_IMAGE
  • OPENHIVE_SANDBOX_IMAGE
  • OPENHIVE_PIPELINE_IMAGE (optional; defaults to sandbox image)
  • OPENHIVE_INGRESS_HOST
  • OPENHIVE_INGRESS_TLS_SECRET
  • OPENHIVE_INGRESS_CLASS
  • OPENHIVE_CERT_MANAGER_CLUSTER_ISSUER

The installer renders a temporary values-specific Kustomize tree, then runs the repeatable sequence:

  1. apply namespaces
  2. apply openhive-platform-secrets
  3. rerun the migration Job and wait for success
  4. apply the full platform runtime
  5. wait for Gateway, Dashboard, Agent, and Sandbox rollouts

Render and Apply

kubectl kustomize deploy/k8s/base
kubectl apply -k deploy/k8s/base

CI and local validation:

make k8s-validate
./scripts/k8s/verify-bootstrap.sh

Full Gateway Runtime Overlay

The base deployment intentionally keeps Gateway on the health-only hive.container.gateway_entrypoint:app entrypoint so low-level image and probe checks stay lightweight.

To run the real OpenHive control plane in-cluster, use:

kubectl kustomize deploy/k8s/overlays/full-gateway-runtime
kubectl apply -k deploy/k8s/overlays/full-gateway-runtime

This overlay keeps the same Service and image, but patches the Gateway deployment to:

  • start hive.main:app
  • run under the dedicated hive-gateway-runtime ServiceAccount, with namespace-scoped RBAC in hive-agents and hive-sandbox
  • mount a writable workspace at /data/hive
  • select HIVE_POOL_BACKEND=container with HIVE_CONTAINER_RUNTIME_BACKEND=kubernetes so Gateway creates isolated agent runtimes through the Kubernetes backend instead of the in-process local pool
  • point those isolated runtimes back at Gateway through HIVE_AGENT_RUNTIME_GATEWAY_URL=http://hive-gateway.openhive.svc.cluster.local:8080
  • declare HIVE_AGENT_RUNTIME_IMAGE=ghcr.io/terrywangcode/openhive-agent:latest and HIVE_K8S_AGENT_NAMESPACE=hive-agents for the first supported lifecycle path
  • declare HIVE_SANDBOX_URL=http://hive-sandbox.hive-sandbox.svc.cluster.local:8091 so the full runtime path reaches the in-cluster sandbox service explicitly
  • enable HIVE_STRICT_STARTUP=true so missing secrets or DB startup failures crash the pod instead of serving a degraded control plane
  • switch startup into HIVE_STARTUP_MIGRATION_MODE=check, which requires the DB schema to already be at the current Alembic head
  • disable HIVE_METADATA_CREATE_ON_STARTUP so the K8s path does not mutate the schema outside the explicit migration flow
  • add a startupProbe on /healthz so migrations and startup wiring can complete before liveness checks take over
  • widen Gateway ingress to the openhive namespace so an in-cluster dashboard can reach the API
  • allow Gateway egress to TCP 8090 so it can reach per-agent runtime Services
  • allow Gateway egress to TCP 8091 so it can reach the shared sandbox Service
  • allow Gateway egress to TCP 5432 for the first external PostgreSQL target
  • allow sandbox egress to TCP 5432 for the shared PostgreSQL-backed dev-task state
  • allow sandbox egress to TCP 443 so the direct codex_cli path can reach HTTPS model/provider endpoints when operators enable real provider-backed execution
  • allow Gateway egress to TCP 443 so container-pool mode can reach the Kubernetes API and the normal HTTPS upstreams used by the full control plane

Create the required secrets before applying the overlay:

kubectl create secret generic openhive-platform-secrets -n openhive \
  --from-literal=database-url='postgresql+asyncpg://hive:password@postgres.example:5432/hive' \
  --from-literal=dashboard-session-secret='replace-with-random-secret' \
  --from-literal=gateway-internal-secret='replace-with-internal-relay-secret' \
  --from-literal=admin-username='admin' \
  --from-literal=admin-password='change-me'

kubectl create secret generic openhive-platform-secrets -n hive-sandbox \
  --from-literal=database-url='postgresql+asyncpg://hive:password@postgres.example:5432/hive' \
  --from-literal=codex-model='gpt-5.4' \
  --from-literal=codex-provider-id='club' \
  --from-literal=codex-provider-name='ai code club' \
  --from-literal=codex-base-url='https://claude-code.club/openai' \
  --from-literal=codex-wire-api='responses' \
  --from-literal=codex-env-key='OPENAI_API_KEY' \
  --from-literal=codex-requires-openai-auth='true' \
  --from-literal=openai-api-key='replace-if-using-real-codex-cli' \
  --from-literal=anthropic-api-key='replace-if-using-real-codex-cli' \
  --from-literal=deepseek-api-key='replace-if-using-real-codex-cli' \
  --from-literal=qwen-api-key='replace-if-using-real-codex-cli'

Minimum env/secret surface for the full runtime overlay:

  • DATABASE_URL
  • DASHBOARD_SESSION_SECRET
  • HIVE_INTERNAL_SECRET
  • HIVE_ADMIN_USERNAME
  • HIVE_ADMIN_PASSWORD
  • HIVE_WORKSPACE=/data/hive
  • HIVE_POOL_BACKEND=container
  • HIVE_CONTAINER_RUNTIME_BACKEND=kubernetes
  • HIVE_AGENT_RUNTIME_GATEWAY_URL=http://hive-gateway.openhive.svc.cluster.local:8080
  • HIVE_AGENT_RUNTIME_IMAGE=ghcr.io/terrywangcode/openhive-agent:latest
  • HIVE_K8S_AGENT_NAMESPACE=hive-agents
  • HIVE_DEPLOYMENT_BACKEND=kubernetes
  • HIVE_PIPELINE_IMAGE=ghcr.io/terrywangcode/openhive-sandbox:latest
  • HIVE_K8S_PIPELINE_NAMESPACE=hive-sandbox
  • HIVE_SANDBOX_URL=http://hive-sandbox.hive-sandbox.svc.cluster.local:8091
  • sandbox codex_cli auth:
    • HIVE_SANDBOX_CODEX_AUTH_MODE=env
    • HIVE_SANDBOX_CODEX_SANDBOX_MODE=danger-full-access in the Kubernetes sandbox pod, because the pod itself is the outer isolation boundary and the default inner bubblewrap sandbox is not available in this runtime
    • HIVE_SANDBOX_CODEX_ENV_ALLOWLIST=OPENAI_API_KEY,ANTHROPIC_API_KEY,DEEPSEEK_API_KEY,QWEN_API_KEY
    • optional provider bootstrap envs HIVE_SANDBOX_CODEX_MODEL, HIVE_SANDBOX_CODEX_PROVIDER_ID, HIVE_SANDBOX_CODEX_PROVIDER_NAME, HIVE_SANDBOX_CODEX_BASE_URL, HIVE_SANDBOX_CODEX_WIRE_API, HIVE_SANDBOX_CODEX_ENV_KEY, and HIVE_SANDBOX_CODEX_REQUIRES_OPENAI_AUTH let the sandbox materialize ~/.codex/config.toml at runtime when a real openai-compatible Codex provider needs more than a bare API key
    • optional sandbox-namespace secret keys openai-api-key, anthropic-api-key, deepseek-api-key, qwen-api-key, plus optional codex-model, codex-provider-id, codex-provider-name, codex-base-url, codex-wire-api, codex-env-key, and codex-requires-openai-auth
  • optional sandbox workspace apply relay:
    • HIVE_SANDBOX_WORKSPACE_APPLY_RELAY_URL points to an operator-controlled HTTP endpoint that can apply approved patches to host-local workspaces when the sandbox was seeded from workspace_archive_b64
    • HIVE_SANDBOX_WORKSPACE_APPLY_RELAY_TOKEN is sent as a bearer token to that endpoint
    • this relay is only for approved patch apply-back; it is distinct from the model-token relay and does not change provider-secret residency claims
    • if the relay endpoint is outside the cluster, operators must add a matching sandbox egress rule for that host and port
  • HIVE_STRICT_STARTUP=true
  • HIVE_STARTUP_MIGRATION_MODE=check
  • HIVE_METADATA_CREATE_ON_STARTUP=false

The base sandbox deployment now opts the in-cluster codex_cli path into this explicit provider-env mode by wiring those optional secret refs into the sandbox runtime env. If the secret keys are absent, the pod still starts and the governed codex child receives no provider keys. This direct-env mode is operationally useful for real codex_cli execution, but it is still distinct from the opt-in relay_helper proof path and any future token-based relay flow. When provider bootstrap envs are present, the sandbox entrypoint writes a minimal runtime ~/.codex/config.toml once at startup and leaves that runtime copy in place, so the default codex_cli path can target an OpenAI-compatible provider without baking provider config into the image. The base manifest now mounts a dedicated writable home at /home/codex; using /tmp as HOME causes real non-ephemeral Codex runs to stall in containerized proof/runtime paths.

Supported Kubernetes DB Mode

The first supported Kubernetes DB mode is intentionally narrow:

  • operator-managed external PostgreSQL
  • no bundled in-cluster PostgreSQL deployment yet
  • no automatic DB backup, restore, or lifecycle ownership by OpenHive

That means:

  • OpenHive expects a reachable PostgreSQL URL in DATABASE_URL
  • operators own PostgreSQL provisioning, TLS, backup, restore, and retention
  • OpenHive only owns the migration execution contract described next

Repeatable Migration Job

Run Alembic migrations in-cluster with the dedicated Job manifests before starting or upgrading the full Gateway runtime:

kubectl delete job openhive-db-migrate -n openhive --ignore-not-found=true
kubectl apply -k deploy/k8s/jobs/external-postgres-migration
kubectl wait -n openhive --for=condition=complete job/openhive-db-migrate --timeout=180s
kubectl logs -n openhive job/openhive-db-migrate

The migration Job:

  • reuses the ghcr.io/terrywangcode/openhive-gateway:latest image
  • runs alembic upgrade head
  • reads only DATABASE_URL from the same openhive-platform-secrets secret
  • is safe to rerun after deleting the previous Job object

The sandbox deployment also reads DATABASE_URL from an openhive-platform-secrets Secret in the hive-sandbox namespace so dev-task creation and review state persist through the shared PostgreSQL database.

Full Platform Runtime Overlay

Use the full platform overlay when you want the operator-facing dashboard, combined ingress, and full Gateway runtime together:

kubectl kustomize deploy/k8s/overlays/full-platform-runtime
kubectl apply -k deploy/k8s/overlays/full-platform-runtime

This overlay builds on top of full-gateway-runtime and adds:

  • a hive-dashboard Deployment that runs the standalone Next.js server on port 3000
  • a hive-dashboard Service for in-cluster and port-forward access
  • HIVE_GATEWAY_INTERNAL_URL=http://hive-gateway.openhive.svc.cluster.local:8080 so dashboard-originated /api/* and /healthz requests stay same-origin to the browser while proxying in-cluster to Gateway
  • a dedicated dashboard probe endpoint at GET /dashboard-healthz
  • a combined Ingress that routes /api and /healthz to Gateway and / to the dashboard

The included Ingress is intentionally an example-ready default:

  • ingressClassName: nginx
  • cert-manager.io/cluster-issuer: letsencrypt-prod
  • placeholder host openhive.example.com
  • placeholder TLS secret openhive-example-tls

Before applying in a real cluster, patch deploy/k8s/overlays/full-platform-runtime/platform-ingress.yaml for your hostnames, TLS secret, and ingress-controller annotations.

Without an ingress controller, you can still verify the operator-facing path by port-forwarding the dashboard service:

kubectl port-forward -n openhive svc/hive-dashboard 3000:3000

Then open http://127.0.0.1:3000. Dashboard login and authenticated API calls still work because the dashboard container proxies /api/* and /healthz through the in-cluster Gateway service.

Recommended order for the current productized preview slice:

  1. copy deploy/k8s/preview-installer/values.env.example to a private env file
  2. fill in real secret values, image references, and ingress settings
  3. run make k8s-preview-plan env_file=/path/to/env
  4. run make k8s-preview-install env_file=/path/to/env
  5. verify kubectl logs -n openhive deploy/hive-gateway
  6. verify kubectl logs -n openhive deploy/hive-dashboard
  7. verify dashboard reachability via ingress host or kubectl port-forward

Upgrade and rollback expectations for the supported preview scope:

  • upgrade by updating image references or secrets in the env file, then rerun make k8s-preview-install
  • the installer always reruns the migration Job before applying the runtime
  • rollback is limited to reapplying a previously known-good env file; OpenHive does not provide automatic PostgreSQL rollback or backup orchestration
  • do not roll back across schema-incompatible releases unless the database has been restored through the operator-owned PostgreSQL process

Failure behavior is intentionally explicit:

  • if the migration Job fails, do not start the full platform runtime yet
  • if Gateway starts before migrations are current, HIVE_STARTUP_MIGRATION_MODE=check makes the pod fail with a clear startup error instructing the operator to run the migration Job
  • if PostgreSQL is unreachable, both the Job and the strict Gateway startup fail fast instead of serving a degraded control plane
  • if the dashboard cannot reach Gateway, /dashboard-healthz still proves the dashboard container is alive while login and /api/* traffic will surface the upstream connectivity problem

Kubernetes Agent Runtime Backend

With the full Gateway runtime overlay, the application now selects ContainerAgentPool automatically and the Kubernetes runtime backend manages one Pod and one Service per OpenHive agent identity. The first supported resource model is intentionally small and direct:

  • deterministic runtime name derived from agent_id and safe for Kubernetes DNS
  • Pod annotations for openhive.io/agent-id and openhive.io/controller-id
  • Service DNS base URL of the form http://<runtime-name>.hive-agents.svc.cluster.local:8090
  • per-agent config bootstrap through the existing agent image and bootstrap-config.sh init container
  • ephemeral emptyDir workspace for the first lifecycle slice

That means operators can map an OpenHive agent runtime row back to a concrete Kubernetes Pod/Service pair without relying on lossy label-safe rewrites of the original agent_id.

Kubernetes Pipeline Workloads

With HIVE_DEPLOYMENT_BACKEND=kubernetes, Keeper's deploy_pipeline and get_pipeline_status tools stop falling back to Docker-only behavior and use a Kubernetes-native workload contract instead.

The first supported mapping is intentionally narrow:

  • recurring schedules such as 0 * * * * -> CronJob
  • one-off markers such as an empty schedule or @once -> Job
  • long-running worker-style Deployment workloads remain out of the first supported slice

The plugin writes the same minimum pipeline manifest that local preview flows already use:

project_id: proj_a
pipeline_name: hourly-classify
schedule: 0 * * * *
steps:
  - fetch
  - classify

In the Kubernetes path, that manifest is rendered into a managed ConfigMap named from the project and pipeline identity, then mounted into the Pipeline workload at:

  • HIVE_WORKSPACE=/data/pipelines
  • /data/pipelines/<project_id>/pipeline/pipeline.yaml

The first supported runtime target is the sandbox image:

  • namespace: hive-sandbox
  • image: ghcr.io/terrywangcode/openhive-sandbox:latest
  • command: python -m hive.pipelines.job_entrypoint

Status reporting is also Kubernetes-native:

  • get_pipeline_status lists managed CronJob and Job workloads
  • recurring workloads surface the latest child Job state when one failed
  • operators can confirm the same workload with kubectl get cronjobs,jobs -n hive-sandbox -l openhive.io/pipeline-managed=kubernetes
  • workload annotations include openhive.io/project-id, openhive.io/pipeline-name, and openhive.io/resource-kind so a failed Job can be mapped back to the OpenHive pipeline without opening the container

Backup And Restore Expectations

In this first Kubernetes DB mode, PostgreSQL backups and restores remain an operator responsibility. Before applying migrations in production-like environments, take an external backup such as:

PGPASSWORD="$DB_PASSWORD" pg_dump \
  --format=custom \
  --host="$DB_HOST" \
  --port="${DB_PORT:-5432}" \
  --username="$DB_USER" \
  --dbname="$DB_NAME" \
  > openhive-pre-migration.dump

OpenHive's migration Job does not create backups, does not drop databases, and does not own restore orchestration for external PostgreSQL.

Operator notes:

  • the overlay assumes PostgreSQL is reachable on TCP 5432; patch the NetworkPolicy if your managed DB uses a different port or a narrower egress CIDR
  • the Gateway workspace mount is currently emptyDir, which is enough for startup and API reachability checks but not yet a durable multi-pod storage story
  • the full platform overlay includes a combined ingress example, but operators still own ingress-controller installation and host-specific TLS configuration

Config Bootstrap Flow

The agent deployment mounts agent-config-pvc at /data/config and runs the config-bootstrap init container before the main container starts.

Bootstrap behavior:

  • source defaults: /app/defaults/agent
  • target volume: /data/config
  • copy mode: only copy files that do not already exist
  • version marker: write /data/config/.version

To verify no-clobber behavior:

  1. kubectl exec -n hive-agents deploy/hive-agent -- sh -c "echo custom > /data/config/HEARTBEAT.md"
  2. Restart the pod: kubectl rollout restart -n hive-agents deploy/hive-agent
  3. Confirm the edit survived: kubectl exec -n hive-agents deploy/hive-agent -- cat /data/config/HEARTBEAT.md
  4. Update the image tag or bootstrap version, then confirm newly added default files appear without replacing edited files

NetworkPolicy Assertions

These are the minimum checks before closing #35:

  1. Agent pod can reach Gateway on port 8080
  2. Agent pod can reach Sandbox on port 8091
  3. Agent pod cannot reach an arbitrary external address
  4. Sandbox pod can reach Gateway on port 8080
  5. Sandbox pod cannot reach Agent on port 8090

Example manual checks:

kubectl exec -n hive-agents deploy/hive-agent -- wget -qO- http://hive-gateway.openhive.svc.cluster.local:8080/healthz
kubectl exec -n hive-agents deploy/hive-agent -- wget -qO- http://hive-sandbox.hive-sandbox.svc.cluster.local:8091/healthz
kubectl exec -n hive-sandbox deploy/hive-sandbox -- wget -qO- http://hive-gateway.openhive.svc.cluster.local:8080/healthz

Expected blocked checks:

kubectl exec -n hive-agents deploy/hive-agent -- wget -T 3 -qO- https://example.com
kubectl exec -n hive-sandbox deploy/hive-sandbox -- wget -T 3 -qO- http://hive-agent.hive-agents.svc.cluster.local:8090/healthz

Per-Skill External Whitelist

The base sandbox policy denies arbitrary external egress. Add per-environment whitelists through an overlay that appends explicit ipBlock egress rules for the sandbox namespace.

Example:

kubectl kustomize deploy/k8s/overlays/sandbox-whitelist-example

Runtime Contract Summary

The Kubernetes baseline assumes these writable paths:

  • Gateway: none required in the health-only baseline; /data/hive in the full runtime overlay
  • Dashboard: none required in the standalone runtime image
  • Agent: /data/config
  • Sandbox: /sandbox/commands, /sandbox/tasks, and /tmp

The sandbox deployment mounts explicit writable volumes because the runtime contract keeps the root filesystem read-only while still allowing task-local logs, scratch files, and artifacts.

Operator Diagnostics

Runtime pods now expose enough probe metadata to separate "still starting" from "bootstrapped incorrectly":

  • status=starting means the runtime has not finished initialization
  • status=error means bootstrap failed and readiness_reason carries the actionable failure text
  • agent_id, project_id, controller_id, and deployment_backend map the pod back to OpenHive ownership without relying on log guessing

Useful checks:

kubectl get pod -n hive-agents <runtime-pod> -o jsonpath='{.metadata.annotations}'
kubectl exec -n hive-agents <runtime-pod> -- wget -qO- http://localhost:8090/healthz | jq .
kubectl get cronjobs,jobs -n hive-sandbox -o json | jq '.items[] | {name: .metadata.name, annotations: .metadata.annotations}'

Installer validation steps after make k8s-preview-install:

kubectl get pods -n openhive
kubectl get pods -n hive-agents
kubectl get pods -n hive-sandbox
kubectl logs -n openhive deploy/hive-gateway --tail=100
kubectl logs -n openhive deploy/hive-dashboard --tail=100
kubectl get ingress -n openhive openhive-platform

Keeper dev-task responses also expose sandbox runtime mapping through GET /dev-tasks/{task_id}:

  • runtime.backend_run_id links the OpenHive task to the backend execution row
  • runtime.execution_class shows whether the task ran as local, readonly, or networked sandbox work
  • runtime.artifact_root and runtime.log_root tell operators where to look next