feat: inject sandbox UID via compute driver instead of requiring it in container images

## Problem Statement

Every container image used as an OpenShell sandbox must bake in a `sandbox` user and group at a specific UID/GID. The bring-your-own-container docs require UID `1000660000`, the VM driver hardcodes UID `10001`, and the policy engine rejects any `run_as_user` value other than `"sandbox"`. This creates friction for image authors, prevents compatibility with environments that allocate their own UID ranges (e.g., OpenShift SCCs), and ignores the Kubernetes-native `securityContext` mechanism for injecting user identity at runtime.

The goal is to eliminate the requirement that the sandbox user exist in the container image and instead have the compute driver inject the desired UID/GID at sandbox creation time. On OpenShift, the UID should be auto-detected from namespace SCC annotations.

## Technical Context

The sandbox user identity flows through 6 distinct layers today — from image build time, through policy validation, to supervisor privilege dropping. Each layer assumes a user named `"sandbox"` exists in the container's `/etc/passwd`. The supervisor starts as root (UID 0) and drops privileges to the sandbox user via `setuid()`/`setgid()` after setting up network namespaces, Landlock, and seccomp. The Kubernetes driver already constructs `securityContext` on the pod spec (setting `runAsUser: 0` for the supervisor), so extending it to pass the sandbox UID is architecturally straightforward. The proto already uses a `string` field for `run_as_user`, so numeric UIDs require no wire format change.

## Affected Components

| Component | Key Files | Role |
|-----------|-----------|------|
| Policy engine | `crates/openshell-policy/src/lib.rs` | Validates `run_as_user`/`run_as_group`, currently rejects anything other than `"sandbox"` |
| Supervisor (process) | `crates/openshell-supervisor-process/src/process.rs` | Validates sandbox user exists, drops privileges via `setuid()`/`setgid()`, chowns filesystem |
| Supervisor (SSH) | `crates/openshell-supervisor-process/src/ssh.rs` | Derives `USER`/`HOME` env vars from policy user |
| Kubernetes driver | `crates/openshell-driver-kubernetes/src/driver.rs` | Constructs pod spec, sets `securityContext`, manages PVC init containers |
| VM driver | `crates/openshell-driver-vm/src/rootfs.rs` | Writes sandbox user into guest rootfs at image prep time |
| Docker driver | `crates/openshell-driver-docker/src/lib.rs` | Uses `--user` on `docker run` |
| Proto | `proto/sandbox.proto` | Defines `ProcessPolicy.run_as_user` as string (no change needed) |
| BYOC example | `examples/bring-your-own-container/` | Documents and demonstrates sandbox user creation |

## Technical Investigation

### Architecture Overview

The sandbox user identity is established at image build time and consumed at 6 points during sandbox lifecycle:

1. **Image build** — `groupadd`/`useradd` creates the `sandbox` user at a fixed UID in the Dockerfile
2. **Policy normalization** — `ensure_sandbox_process_identity()` defaults empty `run_as_user`/`run_as_group` to `"sandbox"`
3. **Policy validation** — `validate_sandbox_policy()` hard-rejects any non-`"sandbox"` value
4. **Supervisor user validation** — `validate_sandbox_user()` calls `User::from_name("sandbox")` against `/etc/passwd`
5. **Privilege dropping** — `drop_privileges()` resolves `"sandbox"` to a UID via `User::from_name()`, then calls `setgid()`/`setuid()` with verification
6. **Filesystem prep** — `prepare_filesystem()` resolves the sandbox user for `chown` of `read_write` directories

The supervisor runs as root (UID 0) to create network namespaces, set up the proxy, and configure Landlock/seccomp. It drops to the sandbox UID only for child processes. The Kubernetes driver forces `securityContext.runAsUser = 0` on the main container for this reason.

### Code References

| Location | Description |
|----------|-------------|
| `openshell-policy/src/lib.rs:660-668` | `ensure_sandbox_process_identity()` — defaults empty user/group to `"sandbox"` |
| `openshell-policy/src/lib.rs:756-772` | `validate_sandbox_policy()` — hard-rejects non-`"sandbox"` values for `run_as_user`/`run_as_group` |
| `openshell-policy/src/lib.rs:680-697` | `PolicyViolation` enum — would need new `UidOutOfRange` variant |
| `openshell-supervisor-process/src/process.rs:758-786` | `validate_sandbox_user()` — calls `User::from_name("sandbox")`, fails if missing from image |
| `openshell-supervisor-process/src/process.rs:892-998` | `drop_privileges()` — resolves name → UID via `User::from_name()`, calls `setgid()`/`setuid()` with verification |
| `openshell-supervisor-process/src/process.rs:788-870` | `prepare_filesystem()` — resolves sandbox user/group for `chown` of `read_write` directories |
| `openshell-supervisor-process/src/ssh.rs:221-225` | SSH session — derives `USER`/`HOME` from policy `run_as_user`, defaults to `"sandbox"`/`"/sandbox"` |
| `openshell-driver-kubernetes/src/driver.rs:970-981` | K8s driver — forces `securityContext.runAsUser = 0` on supervisor container |
| `openshell-driver-kubernetes/src/driver.rs:994+` | PVC workspace init container — seeds PVCs, needs sandbox UID for chown |
| `openshell-driver-vm/src/rootfs.rs:755-772` | VM driver — hardcodes `SANDBOX_UID = 10001` / `SANDBOX_GID = 10001` in rootfs |
| `proto/sandbox.proto:47-52` | `ProcessPolicy` — `run_as_user` and `run_as_group` are `string` fields |
| `examples/bring-your-own-container/Dockerfile:20-21` | BYOC example — `groupadd -g 1000660000 sandbox && useradd -m -u 1000660000 -g sandbox sandbox` |
| `e2e/rust/tests/custom_image.rs:27-28` | E2E test image — same `1000660000` pattern |

### Current Behavior

When a sandbox is created:
1. The policy's `run_as_user` is defaulted to `"sandbox"` if empty, then validated — only `"sandbox"` is accepted.
2. The supervisor calls `User::from_name("sandbox")` against the container's `/etc/passwd`. If the user doesn't exist, startup fails with: _"sandbox user 'sandbox' not found in image; all sandbox images must include a 'sandbox' user and group"_.
3. `drop_privileges()` resolves `"sandbox"` → numeric UID via `User::from_name()`, then calls `setgid()`/`setuid()` with post-drop verification (defense-in-depth: confirms UID changed, confirms root can't be re-acquired).
4. `prepare_filesystem()` resolves the sandbox user for `chown` of `read_write` directories before forking the child process.

### What Would Need to Change

**Policy engine** — Relax the hard `"sandbox"` string check to also accept numeric UID strings within a platform-level range:
```rust
const MIN_SANDBOX_UID: u32 = 1000;
const MAX_SANDBOX_UID: u32 = 2_000_000_000;
```
Accept `"sandbox"` (existing) or any `u32` in `[MIN_SANDBOX_UID, MAX_SANDBOX_UID]`. Reject `"root"`, UID 0, system UIDs below 1000, and non-numeric garbage. Add `UidOutOfRange` violation variant for clear error messages. The range is a platform safety constant, not a per-policy knob.

**Supervisor** — `validate_sandbox_user()`, `drop_privileges()`, and `prepare_filesystem()` must accept numeric UIDs:
- If the value parses as `u32`, skip `/etc/passwd` lookup and use the UID directly (`setuid()`/`setgid()` do not require a passwd entry).
- If it's a name, keep the existing name-based lookup.
- SSH session should derive `USER=sandbox` and `HOME=/sandbox` as defaults when no passwd entry exists.

**Kubernetes driver** — Add `sandbox_uid`/`sandbox_gid` to driver config. Pass the UID to the supervisor through the policy's `run_as_user` field. The supervisor container stays `runAsUser: 0`. The PVC init container uses the injected UID for chown.

**OpenShift SCC-aware UID resolution** — On OpenShift, read namespace annotations to auto-select the sandbox UID:
1. Read namespace metadata via `Api<Namespace>::get()` (driver doesn't currently do this — requires adding the call).
2. Parse `openshift.io/sa.scc.uid-range` annotation (format: `<start>/<size>`, e.g., `1000660000/10000`). Use range start as sandbox UID.
3. Parse `openshift.io/sa.scc.supplemental-groups` for GID. Fall back to UID range start if absent.
4. If neither annotation is present (vanilla Kubernetes), fall back to configured `sandbox_uid`/`sandbox_gid`.
5. Validate resolved UID/GID against `[MIN_SANDBOX_UID, MAX_SANDBOX_UID]`.

This is passive detection (annotation presence) — no explicit "OpenShift mode" config flag needed.

**VM driver** — Use configurable UID instead of hardcoded `10001`. The VM driver controls its own rootfs, so it can continue creating the user at rootfs prep time.

**BYOC / docs** — Remove `groupadd`/`useradd` requirement from examples and documentation.

### Alternative Approaches Considered

1. **NSS module in the supervisor** — Synthesize a `sandbox` passwd entry via custom NSS module. Adds runtime dependency and complexity. Rejected: directly using numeric UIDs is simpler and more portable.

2. **Init container running `useradd`** — Create the user at container start. Requires the image to have user management tools and writable `/etc/passwd`. Rejected: many minimal images lack these tools.

3. **Better documentation only** — Just improve the BYOC docs. Doesn't solve the underlying friction or OpenShift UID range incompatibility.

### Patterns to Follow

- The Kubernetes driver already constructs `securityContext` on pod specs (`driver.rs:970-981`) — the sandbox UID injection follows the same JSON manipulation pattern.
- The policy engine already has a `PolicyViolation` enum with descriptive variants and `Display` impls — the new `UidOutOfRange` variant should follow the same pattern.
- The supervisor's `drop_privileges()` already has defense-in-depth verification (confirms UID changed, confirms root can't be re-acquired) — numeric UID support must maintain these checks.

## Proposed Approach

Implement in three phases. Phase 1 teaches the policy engine and supervisor to accept numeric UIDs within a safe range (`[1000, 2_000_000_000]`), removing the hard dependency on a `/etc/passwd` entry. Phase 2 adds `sandbox_uid`/`sandbox_gid` config to the Kubernetes driver and injects it via the policy, with passive OpenShift SCC annotation detection for automatic UID selection on OpenShift clusters. Phase 3 removes the image-side user requirement from examples, docs, and e2e tests.

## Scope Assessment

- **Complexity:** Medium
- **Confidence:** High — clear path for Phases 1-2, Phase 3 is straightforward cleanup
- **Estimated files to change:** ~11
- **Issue type:** `feat`

## Risks & Open Questions

1. **Programs requiring a passwd entry** — Some programs (`sudo`, `ssh`) fail if the running UID has no `/etc/passwd` entry. Should the supervisor write a synthetic passwd entry at startup before dropping privileges?

2. **Home directory creation** — If the image doesn't have `/sandbox` created, who creates it? The supervisor could create and chown it during `prepare_filesystem()`, but this needs to happen before the child process starts.

3. **File ownership in image layers** — Files in the image owned by the old sandbox UID will appear as owned by a different user. Only affects images previously built with the sandbox user.

4. **Security boundary** — The range check (`MIN_SANDBOX_UID = 1000` through `MAX_SANDBOX_UID = 2_000_000_000`) replaces the current `"sandbox"` string check as the non-root invariant. It rejects UID 0, system UIDs below 1000, and unreasonably large values. The range is enforced as platform-level constants, not per-policy configurable.

5. **Docker driver** — The Docker driver uses `--user` on `docker run`. Should it also adopt the configurable UID pattern, or is this Kubernetes-only initially?

6. **Gateway config docs** — If `sandbox_uid`/`sandbox_gid` are added to the Kubernetes driver config, `docs/reference/gateway-config.mdx` and relevant compute-driver setup docs must be updated.

## Test Considerations

- **Unit tests** — Policy validation tests must cover: `"sandbox"` (pass), numeric UID in range (pass), numeric UID out of range (fail), `"root"` (fail), `"0"` (fail), non-numeric string (fail). Supervisor tests must cover numeric UID privilege dropping without passwd entry.
- **Integration tests** — Kubernetes driver tests must verify pod spec includes correct `securityContext` when `sandbox_uid` is configured, and that OpenShift annotation parsing works correctly.
- **E2E tests** — Update `e2e/rust/tests/custom_image.rs` to use an image without a baked-in sandbox user. Verify sandbox creation, privilege dropping, and filesystem ownership work with injected UIDs.
- **Existing test patterns** — `openshell-driver-kubernetes/src/driver.rs` has tests like `supervisor_sideload_injects_run_as_user_zero()` that verify `securityContext` — new tests should follow this pattern. Policy validation tests in `openshell-policy/src/lib.rs` use `validate_sandbox_policy()` assertions — extend these.

---
*Created by spike investigation. Use `build-from-issue` to plan and implement.*

Component	Key Files	Role
Policy engine	`crates/openshell-policy/src/lib.rs`	Validates `run_as_user`/`run_as_group`, currently rejects anything other than `"sandbox"`
Supervisor (process)	`crates/openshell-supervisor-process/src/process.rs`	Validates sandbox user exists, drops privileges via `setuid()`/`setgid()`, chowns filesystem
Supervisor (SSH)	`crates/openshell-supervisor-process/src/ssh.rs`	Derives `USER`/`HOME` env vars from policy user
Kubernetes driver	`crates/openshell-driver-kubernetes/src/driver.rs`	Constructs pod spec, sets `securityContext`, manages PVC init containers
VM driver	`crates/openshell-driver-vm/src/rootfs.rs`	Writes sandbox user into guest rootfs at image prep time
Docker driver	`crates/openshell-driver-docker/src/lib.rs`	Uses `--user` on `docker run`
Proto	`proto/sandbox.proto`	Defines `ProcessPolicy.run_as_user` as string (no change needed)
BYOC example	`examples/bring-your-own-container/`	Documents and demonstrates sandbox user creation

Location	Description
`openshell-policy/src/lib.rs:660-668`	`ensure_sandbox_process_identity()` — defaults empty user/group to `"sandbox"`
`openshell-policy/src/lib.rs:756-772`	`validate_sandbox_policy()` — hard-rejects non-`"sandbox"` values for `run_as_user`/`run_as_group`
`openshell-policy/src/lib.rs:680-697`	`PolicyViolation` enum — would need new `UidOutOfRange` variant
`openshell-supervisor-process/src/process.rs:758-786`	`validate_sandbox_user()` — calls `User::from_name("sandbox")`, fails if missing from image
`openshell-supervisor-process/src/process.rs:892-998`	`drop_privileges()` — resolves name → UID via `User::from_name()`, calls `setgid()`/`setuid()` with verification
`openshell-supervisor-process/src/process.rs:788-870`	`prepare_filesystem()` — resolves sandbox user/group for `chown` of `read_write` directories
`openshell-supervisor-process/src/ssh.rs:221-225`	SSH session — derives `USER`/`HOME` from policy `run_as_user`, defaults to `"sandbox"`/`"/sandbox"`
`openshell-driver-kubernetes/src/driver.rs:970-981`	K8s driver — forces `securityContext.runAsUser = 0` on supervisor container
`openshell-driver-kubernetes/src/driver.rs:994+`	PVC workspace init container — seeds PVCs, needs sandbox UID for chown
`openshell-driver-vm/src/rootfs.rs:755-772`	VM driver — hardcodes `SANDBOX_UID = 10001` / `SANDBOX_GID = 10001` in rootfs
`proto/sandbox.proto:47-52`	`ProcessPolicy` — `run_as_user` and `run_as_group` are `string` fields
`examples/bring-your-own-container/Dockerfile:20-21`	BYOC example — `groupadd -g 1000660000 sandbox && useradd -m -u 1000660000 -g sandbox sandbox`
`e2e/rust/tests/custom_image.rs:27-28`	E2E test image — same `1000660000` pattern

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: inject sandbox UID via compute driver instead of requiring it in container images #1959

Problem Statement

Technical Context

Affected Components

Technical Investigation

Architecture Overview

Code References

Current Behavior

What Would Need to Change

Alternative Approaches Considered

Patterns to Follow

Proposed Approach

Scope Assessment

Risks & Open Questions

Test Considerations

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: inject sandbox UID via compute driver instead of requiring it in container images #1959

Description

Problem Statement

Technical Context

Affected Components

Technical Investigation

Architecture Overview

Code References

Current Behavior

What Would Need to Change

Alternative Approaches Considered

Patterns to Follow

Proposed Approach

Scope Assessment

Risks & Open Questions

Test Considerations

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions