- Goals & non-goals
Goals • Run a process on a remote worker by streaming stdin → process and piping stdout/stderr back in near-real-time. • Both client and worker connect outbound over HTTPS/WSS to a public Broker. • Keep the protocol texty where helpful and binary where needed, debuggable with wscat/curl. • Minimal auth: static bearer tokens (separate lists for clients vs workers).
Non-goals (v0) • No PTY/interactive terminal, no window resize. • No JWT/mTLS, no job caching, no queues beyond a simple rendezvous. • No multiplexing multiple jobs per socket (one WS per job). • No sandboxing beyond "workers run in Docker/containment the ops team provides". • No TLS termination in Broker (handled by Kubernetes ingress).
- Components • Broker (internet-facing): • Receives WebSocket connections (TLS termination handled by Kubernetes ingress). • Authenticates connections via static tokens. • Pairs one Client and one Worker on the same job_category. • Relays frames bidirectionally without inspecting payloads (except control headers for pairing). • Client: • Connects to Broker (WSS), authenticates as client. • Sends a spawn request (command/env/cwd), then streams stdin chunks and closes. • Receives stdout/stderr chunks and exit status. • Worker: • Connects to Broker (WSS), authenticates as worker. • Advertises which job_category it handles (can spawn 0 to many jobs). • On pairing, receives spawn, starts the process, wires pipes, streams output, reports exit.
Concurrency model (KISS) • One job per WebSocket connection. • A “beefy” worker runs N instances of the worker agent (or starts N connections), each handling one job.
-
Transport & deployment • Transport: WebSocket (Broker runs plain WS, TLS handled by Kubernetes ingress). • Subprotocol (optional): Sec-WebSocket-Protocol: rendezexec.v0 • Ports: Broker listens on 8080 internally, exposed as 443 via ingress. • Timeouts (suggested defaults): • Pairing wait: 30s (client or worker waits for its counterpart). • Idle no-data: 5 min (tear down if no app traffic and child finished). • Ping/heartbeat: every 20s (Broker → both sides), 20s grace.
-
Authentication & authorization (static tokens) • Broker config contains two sets: • client_tokens: [ "c_abc…", "c_def…" ] • worker_tokens: [ "w_abc…", "w_def…" ] • Client/Worker send HTTP Authorization header during WS handshake: • Authorization: Bearer • Broker maps token → role. If token not found or wrong role → 401. • Pairing is authorized by role only; any valid client can pair with any valid worker. (Good enough for v0; refine later if you need scoping.)
-
Connection endpoints & query parameters • Client connects: wss://broker.example/ws/client?job_category=&version=0 • Worker connects: wss://broker.example/ws/worker?job_category=&version=0 • job_category identifies the type of work (e.g., "compute", "batch", "gpu"). • Multiple workers can advertise the same job_category (load distributed among them). • If no workers are available for the requested category, client waits until one becomes available. • Broker pairs clients with available workers in the same job_category.
-
Protocol v0 (simple, debuggable)
WebSocket messages are the unit. Libraries reassemble fragmented frames; treat each WS message as one record.
Message classes • Control: WS Text frames. First byte is an ASCII type (visible), remainder is UTF-8 JSON payload (or a simple ASCII token with no JSON). • Data: WS Binary frames. First byte is an ASCII type, remainder is raw bytes (untouched).
Type byte legend (ASCII) • H – Hello/ack (control, text JSON) • S – Spawn request (control, text JSON) — Client → Worker • T – Terminate request (control, text JSON) — Client → Worker (e.g., SIGTERM/SIGKILL) • X – Exit status (control, text JSON) — Worker → Client • E – Error (control, text JSON) — either side → peer (broker may synthesize) • I – stdin chunk (data, binary) — Client → Worker • i – stdin EOF (control, no JSON or empty JSON) — Client → Worker • O – stdout chunk (data, binary) — Worker → Client • R – stderr chunk (data, binary) — Worker → Client • P – Ping/heartbeat (control, tiny JSON { "t": unix_ms }) — either direction (Broker may inject pings as WebSocket pings instead; up to impl)
We intentionally picked visible ASCII for the leading byte so you can read dumps. Control frames are human-readable JSON; data frames put raw bytes after the type byte with no extra framing (the WS message boundary is the framing).
6.1 Control payloads (JSON) • H Hello (both sides send immediately after pair established)
{ "role": "client|worker", "proto": 0, "job_category": "compute", "agent": "rendezexec-client/0.1.0 (linux)" }
• S Spawn (Client → Worker, then Worker responds with H if not already, and later X)
{ "argv": ["/usr/bin/sort", "-r"], "env": { "LC_ALL": "C" }, "cwd": "/work", // optional "stdin": "stream", // "stream" (default) | "none" "stdout_mode": "stream", // reserved for future (e.g., "discard") "stderr_mode": "stream", // reserved for future "time_limit_ms": 0, // 0 = no limit (v0: hint only) "max_output_bytes": 0 // 0 = unbounded (v0: hint only) }
• T Terminate (Client → Worker)
{ "signal": "TERM" } // "TERM" | "KILL" | "INT" (v0: map to SIGTERM/SIGKILL/SIGINT)
• X Exit (Worker → Client)
{ "status": "exited", "code": 0 } // normal exit // or { "status": "signaled", "signal": "KILL" }
• E Error (either side)
{ "where": "broker|client|worker", "stage": "auth|pair|spawn|io|protocol", "message": "text", "retryable": true }
6.2 Data payloads • I: first byte 0x49 ('I'), remaining bytes are raw stdin data. • An empty I frame is allowed (no-op). • EOF is signaled by a separate i control message (text or empty JSON). • O: first byte 0x4F ('O'), remaining bytes are raw stdout data. • R: first byte 0x52 ('R'), remaining bytes are raw stderr data.
Rationale: Using WS message boundaries avoids length-prefixing. Keeping stdin EOF as a distinct i control avoids ambiguity with zero-length chunks.
6.3 Ordering & flow control • Flow control: WebSocket's built-in TCP backpressure handles flow control naturally. • If sending buffers fill up, the sender automatically slows down. • Workers pause reading from child process pipes when WS send buffer is full. • Ordering: • Within a given stream (O vs R vs I) the order of frames MUST be preserved. • Across streams no ordering is guaranteed.
- Lifecycle / sequences
7.1 Normal run
Client Broker Worker | --(WS connect /ws/client?job_category=compute)-> | | | Authorization: Bearer c_token | | | | | | | <--(WS connect /ws/worker?job_category=compute)--- | | | Authorization: Bearer w_token | | <-- pair ok ------------------------> | <------------------------------------ | | -- H {"role":"client",...} --------> | -- relay ---------------------------> | | | <--------- H {"role":"worker",...} -- | | -- S {argv, env,...} --------------> | -- relay ---------------------------> | | -- I ----------------> | -- relay ---------------------------> | | -- i (stdin EOF) ------------------> | -- relay ---------------------------> | | | <----------- O ------- | | | <----------- R ------- | | | <----------- X {status,code} -------- | | | | | -- close --------------------------> | |
7.2 Cancel
Client sends T {"signal":"TERM"}; Worker relays SIGTERM to child. If not exit within e.g. 5s (implementation detail), client may send "KILL".
7.3 Pairing timeout
If no worker is available for the requested job_category within 30s, Broker sends E with stage:"pair" message:"No workers available for category" and closes.
-
Failure handling • Auth fail: HTTP 401 on upgrade (no WS). • Category unavailable: If no workers exist for requested job_category, client waits or times out. • Spawn failure: Worker sends E { stage:"spawn", message:"ENOENT ..." } followed by X { status:"exited", code:127 }. • Stream errors: If any pipe read/write fails, send E { stage:"io" }, then attempt graceful X. • Broker restarts: Clients/workers should retry with exponential backoff (e.g., 200ms → 3s, jitter) until an overall deadline.
-
Configuration (suggested)
9.1 Broker (TOML/YAML)
[server] listen_addr = "0.0.0.0:8080" # Plain WS, TLS handled by Kubernetes ingress
[security]
client_tokens = ["c_dev_local", "c_team_1"] worker_tokens = ["w_dev_local", "w_pool_1"]
[limits] pair_timeout_ms = 30000 idle_timeout_ms = 300000 max_ws_msg_bytes = 1048576
9.2 Worker (TOML/YAML)
[broker] url = "wss://broker.example/ws/worker" token = "w_pool_1"
[job] job_category = "compute" # Category this worker handles
[exec]
9.3 Client (TOML/YAML)
[broker] url = "wss://broker.example/ws/client" token = "c_team_1"
[job] job_category = "compute" # Category to request
-
Rust implementation notes (recommended crates) • Broker: axum (+ tokio-tungstenite), tower, dashmap for waiting room, tokio::sync::mpsc for pumps. • Client/Worker: tokio, tokio-tungstenite, serde_json, nix (signals), tokio::process::Command. • Binary hygiene: treat WS Text frames for control, Binary for data. Reject oversized messages (max_ws_msg_bytes). • Flow control: WebSocket's TCP backpressure handles flow naturally; pause pipe reads when send buffer is full.
-
Process execution semantics (Worker) • Spawn via tokio::process::Command with piped stdin/stdout/stderr. • On i (stdin EOF), close the child’s stdin. • Map T {"signal":"TERM"} → SIGTERM, "KILL" → SIGKILL, "INT" → SIGINT (best-effort on Windows). • When child exits, send X { status:"exited", code:N } or X { status:"signaled", signal:"TERM" }, then close WS.
-
Security considerations (v0) • Broker runs plain WS internally; Kubernetes ingress handles TLS termination for external WSS. • Tokens are shared secrets; keep in env/secret store. Rotate manually for now. • Workers should run inside Docker/k8s with CPU/mem limits; run as non-root; restrict filesystem. • Broker must not exec anything; it only relays.
-
Observability (minimal) • Structured logs (JSON) with trace_id per connection, plus role, remote IP (if available). • Counters: connections_open, jobs_started, jobs_completed, bytes_in/out per role. • Gauges: waiting_clients, waiting_workers. • Histograms: pairing_wait_ms, job_duration_ms, ws_send_queue_depth.
-
CLI sketches (non-binding)
rendezexec-worker --broker wss://b.example/ws/worker
--token $W_TOKEN
--job-category compute
rendezexec-client --broker wss://b.example/ws/client
--token $C_TOKEN
--job-category compute
-- argv/env/cwd in JSON on stdin or flags
Example spawn control (Client sends after connect):
S {"argv":["/bin/cat"],"env":{},"cwd":"/tmp","stdin":"stream"} I ... i
- Edge cases & open invitations (where devs can refine) • Advanced pooling: workers can handle multiple job categories; dynamic load balancing. • Multiplexing: optional per-WS stream IDs to run many jobs on one socket (yamux-like). • Richer auth: per-tenant tokens, allowlists per worker pool, or mTLS. • Flow control: explicit window updates (HTTP/2-style) for large transfers. • Output limits: enforce max_output_bytes on Broker to avoid egress bill shock. • Retry semantics: client resubmission rules if a worker crashes before X.
⸻
Minimal acceptance criteria (v0)
- Happy path: echo "hello" | client → broker → worker → /bin/cat returns hello and X {code:0}.
- EOF behavior: stdin i closes child stdin; process exits.
- Cancel: sending T {"signal":"TERM"} terminates a long-running sleep 9999 within a few seconds; worker reports signaled exit.
- Timeouts: if no worker available for job_category, client gets E after 30s and closes.
- Flow control: large stdout (>50MB) does not crash; TCP backpressure naturally throttles; memory bounded.
- Auth: wrong token → 401; wrong role on endpoint → 401/403.
- Categories: multiple workers can serve same job_category; load distributes among them.
To keep things simple, we could have a single binary that acts as client, broker, or worker based on the parameters.
Integration tests and test driven development can be considered. We don't want to overdo the tests, but a good test coverage and testing strategy will help ensure the system is robust and maintainable.