Skip to content

Latest commit

 

History

History
286 lines (216 loc) · 13.4 KB

File metadata and controls

286 lines (216 loc) · 13.4 KB

rendez-vous-exec — a tiny remote exec over WebSockets

  1. Goals & non-goals

Goals • Run a process on a remote worker by streaming stdin → process and piping stdout/stderr back in near-real-time. • Both client and worker connect outbound over HTTPS/WSS to a public Broker. • Keep the protocol texty where helpful and binary where needed, debuggable with wscat/curl. • Minimal auth: static bearer tokens (separate lists for clients vs workers).

Non-goals (v0) • No PTY/interactive terminal, no window resize. • No JWT/mTLS, no job caching, no queues beyond a simple rendezvous. • No multiplexing multiple jobs per socket (one WS per job). • No sandboxing beyond "workers run in Docker/containment the ops team provides". • No TLS termination in Broker (handled by Kubernetes ingress).

  1. Components • Broker (internet-facing): • Receives WebSocket connections (TLS termination handled by Kubernetes ingress). • Authenticates connections via static tokens. • Pairs one Client and one Worker on the same job_category. • Relays frames bidirectionally without inspecting payloads (except control headers for pairing). • Client: • Connects to Broker (WSS), authenticates as client. • Sends a spawn request (command/env/cwd), then streams stdin chunks and closes. • Receives stdout/stderr chunks and exit status. • Worker: • Connects to Broker (WSS), authenticates as worker. • Advertises which job_category it handles (can spawn 0 to many jobs). • On pairing, receives spawn, starts the process, wires pipes, streams output, reports exit.

Concurrency model (KISS) • One job per WebSocket connection. • A “beefy” worker runs N instances of the worker agent (or starts N connections), each handling one job.

  1. Transport & deployment • Transport: WebSocket (Broker runs plain WS, TLS handled by Kubernetes ingress). • Subprotocol (optional): Sec-WebSocket-Protocol: rendezexec.v0 • Ports: Broker listens on 8080 internally, exposed as 443 via ingress. • Timeouts (suggested defaults): • Pairing wait: 30s (client or worker waits for its counterpart). • Idle no-data: 5 min (tear down if no app traffic and child finished). • Ping/heartbeat: every 20s (Broker → both sides), 20s grace.

  2. Authentication & authorization (static tokens) • Broker config contains two sets: • client_tokens: [ "c_abc…", "c_def…" ] • worker_tokens: [ "w_abc…", "w_def…" ] • Client/Worker send HTTP Authorization header during WS handshake: • Authorization: Bearer • Broker maps token → role. If token not found or wrong role → 401. • Pairing is authorized by role only; any valid client can pair with any valid worker. (Good enough for v0; refine later if you need scoping.)

  3. Connection endpoints & query parameters • Client connects: wss://broker.example/ws/client?job_category=&version=0 • Worker connects: wss://broker.example/ws/worker?job_category=&version=0 • job_category identifies the type of work (e.g., "compute", "batch", "gpu"). • Multiple workers can advertise the same job_category (load distributed among them). • If no workers are available for the requested category, client waits until one becomes available. • Broker pairs clients with available workers in the same job_category.

  4. Protocol v0 (simple, debuggable)

WebSocket messages are the unit. Libraries reassemble fragmented frames; treat each WS message as one record.

Message classes • Control: WS Text frames. First byte is an ASCII type (visible), remainder is UTF-8 JSON payload (or a simple ASCII token with no JSON). • Data: WS Binary frames. First byte is an ASCII type, remainder is raw bytes (untouched).

Type byte legend (ASCII) • H – Hello/ack (control, text JSON) • S – Spawn request (control, text JSON) — Client → Worker • T – Terminate request (control, text JSON) — Client → Worker (e.g., SIGTERM/SIGKILL) • X – Exit status (control, text JSON) — Worker → Client • E – Error (control, text JSON) — either side → peer (broker may synthesize) • I – stdin chunk (data, binary) — Client → Worker • i – stdin EOF (control, no JSON or empty JSON) — Client → Worker • O – stdout chunk (data, binary) — Worker → Client • R – stderr chunk (data, binary) — Worker → Client • P – Ping/heartbeat (control, tiny JSON { "t": unix_ms }) — either direction (Broker may inject pings as WebSocket pings instead; up to impl)

We intentionally picked visible ASCII for the leading byte so you can read dumps. Control frames are human-readable JSON; data frames put raw bytes after the type byte with no extra framing (the WS message boundary is the framing).

6.1 Control payloads (JSON) • H Hello (both sides send immediately after pair established)

{ "role": "client|worker", "proto": 0, "job_category": "compute", "agent": "rendezexec-client/0.1.0 (linux)" }

• S Spawn (Client → Worker, then Worker responds with H if not already, and later X)

{ "argv": ["/usr/bin/sort", "-r"], "env": { "LC_ALL": "C" }, "cwd": "/work", // optional "stdin": "stream", // "stream" (default) | "none" "stdout_mode": "stream", // reserved for future (e.g., "discard") "stderr_mode": "stream", // reserved for future "time_limit_ms": 0, // 0 = no limit (v0: hint only) "max_output_bytes": 0 // 0 = unbounded (v0: hint only) }

• T Terminate (Client → Worker)

{ "signal": "TERM" } // "TERM" | "KILL" | "INT" (v0: map to SIGTERM/SIGKILL/SIGINT)

• X Exit (Worker → Client)

{ "status": "exited", "code": 0 } // normal exit // or { "status": "signaled", "signal": "KILL" }

• E Error (either side)

{ "where": "broker|client|worker", "stage": "auth|pair|spawn|io|protocol", "message": "text", "retryable": true }

6.2 Data payloads • I: first byte 0x49 ('I'), remaining bytes are raw stdin data. • An empty I frame is allowed (no-op). • EOF is signaled by a separate i control message (text or empty JSON). • O: first byte 0x4F ('O'), remaining bytes are raw stdout data. • R: first byte 0x52 ('R'), remaining bytes are raw stderr data.

Rationale: Using WS message boundaries avoids length-prefixing. Keeping stdin EOF as a distinct i control avoids ambiguity with zero-length chunks.

6.3 Ordering & flow control • Flow control: WebSocket's built-in TCP backpressure handles flow control naturally. • If sending buffers fill up, the sender automatically slows down. • Workers pause reading from child process pipes when WS send buffer is full. • Ordering: • Within a given stream (O vs R vs I) the order of frames MUST be preserved. • Across streams no ordering is guaranteed.

  1. Lifecycle / sequences

7.1 Normal run

Client Broker Worker | --(WS connect /ws/client?job_category=compute)-> | | | Authorization: Bearer c_token | | | | | | | <--(WS connect /ws/worker?job_category=compute)--- | | | Authorization: Bearer w_token | | <-- pair ok ------------------------> | <------------------------------------ | | -- H {"role":"client",...} --------> | -- relay ---------------------------> | | | <--------- H {"role":"worker",...} -- | | -- S {argv, env,...} --------------> | -- relay ---------------------------> | | -- I ----------------> | -- relay ---------------------------> | | -- i (stdin EOF) ------------------> | -- relay ---------------------------> | | | <----------- O ------- | | | <----------- R ------- | | | <----------- X {status,code} -------- | | | | | -- close --------------------------> | |

7.2 Cancel

Client sends T {"signal":"TERM"}; Worker relays SIGTERM to child. If not exit within e.g. 5s (implementation detail), client may send "KILL".

7.3 Pairing timeout

If no worker is available for the requested job_category within 30s, Broker sends E with stage:"pair" message:"No workers available for category" and closes.

  1. Failure handling • Auth fail: HTTP 401 on upgrade (no WS). • Category unavailable: If no workers exist for requested job_category, client waits or times out. • Spawn failure: Worker sends E { stage:"spawn", message:"ENOENT ..." } followed by X { status:"exited", code:127 }. • Stream errors: If any pipe read/write fails, send E { stage:"io" }, then attempt graceful X. • Broker restarts: Clients/workers should retry with exponential backoff (e.g., 200ms → 3s, jitter) until an overall deadline.

  2. Configuration (suggested)

9.1 Broker (TOML/YAML)

[server] listen_addr = "0.0.0.0:8080" # Plain WS, TLS handled by Kubernetes ingress

[security]

Exact header must be "Authorization: Bearer "

client_tokens = ["c_dev_local", "c_team_1"] worker_tokens = ["w_dev_local", "w_pool_1"]

[limits] pair_timeout_ms = 30000 idle_timeout_ms = 300000 max_ws_msg_bytes = 1048576

9.2 Worker (TOML/YAML)

[broker] url = "wss://broker.example/ws/worker" token = "w_pool_1"

[job] job_category = "compute" # Category this worker handles

[exec]

Spawn directly in the container namespace the agent runs in

Optional: map a safe PATH, drop caps, rlimits, etc. (v0 can skip)

9.3 Client (TOML/YAML)

[broker] url = "wss://broker.example/ws/client" token = "c_team_1"

[job] job_category = "compute" # Category to request

  1. Rust implementation notes (recommended crates) • Broker: axum (+ tokio-tungstenite), tower, dashmap for waiting room, tokio::sync::mpsc for pumps. • Client/Worker: tokio, tokio-tungstenite, serde_json, nix (signals), tokio::process::Command. • Binary hygiene: treat WS Text frames for control, Binary for data. Reject oversized messages (max_ws_msg_bytes). • Flow control: WebSocket's TCP backpressure handles flow naturally; pause pipe reads when send buffer is full.

  2. Process execution semantics (Worker) • Spawn via tokio::process::Command with piped stdin/stdout/stderr. • On i (stdin EOF), close the child’s stdin. • Map T {"signal":"TERM"} → SIGTERM, "KILL" → SIGKILL, "INT" → SIGINT (best-effort on Windows). • When child exits, send X { status:"exited", code:N } or X { status:"signaled", signal:"TERM" }, then close WS.

  3. Security considerations (v0) • Broker runs plain WS internally; Kubernetes ingress handles TLS termination for external WSS. • Tokens are shared secrets; keep in env/secret store. Rotate manually for now. • Workers should run inside Docker/k8s with CPU/mem limits; run as non-root; restrict filesystem. • Broker must not exec anything; it only relays.

  4. Observability (minimal) • Structured logs (JSON) with trace_id per connection, plus role, remote IP (if available). • Counters: connections_open, jobs_started, jobs_completed, bytes_in/out per role. • Gauges: waiting_clients, waiting_workers. • Histograms: pairing_wait_ms, job_duration_ms, ws_send_queue_depth.

  5. CLI sketches (non-binding)

Worker (one job per connection)

rendezexec-worker --broker wss://b.example/ws/worker
--token $W_TOKEN
--job-category compute

Client (reads stdin, prints stdout/stderr)

rendezexec-client --broker wss://b.example/ws/client
--token $C_TOKEN
--job-category compute
-- argv/env/cwd in JSON on stdin or flags

Example spawn control (Client sends after connect):

S {"argv":["/bin/cat"],"env":{},"cwd":"/tmp","stdin":"stream"} I ... i

  1. Edge cases & open invitations (where devs can refine) • Advanced pooling: workers can handle multiple job categories; dynamic load balancing. • Multiplexing: optional per-WS stream IDs to run many jobs on one socket (yamux-like). • Richer auth: per-tenant tokens, allowlists per worker pool, or mTLS. • Flow control: explicit window updates (HTTP/2-style) for large transfers. • Output limits: enforce max_output_bytes on Broker to avoid egress bill shock. • Retry semantics: client resubmission rules if a worker crashes before X.

Minimal acceptance criteria (v0)

  1. Happy path: echo "hello" | client → broker → worker → /bin/cat returns hello and X {code:0}.
  2. EOF behavior: stdin i closes child stdin; process exits.
  3. Cancel: sending T {"signal":"TERM"} terminates a long-running sleep 9999 within a few seconds; worker reports signaled exit.
  4. Timeouts: if no worker available for job_category, client gets E after 30s and closes.
  5. Flow control: large stdout (>50MB) does not crash; TCP backpressure naturally throttles; memory bounded.
  6. Auth: wrong token → 401; wrong role on endpoint → 401/403.
  7. Categories: multiple workers can serve same job_category; load distributes among them.

Single Binary

To keep things simple, we could have a single binary that acts as client, broker, or worker based on the parameters.

Testing

Integration tests and test driven development can be considered. We don't want to overdo the tests, but a good test coverage and testing strategy will help ensure the system is robust and maintainable.