Background
PR #310 introduces the vsock host↔guest control channel with a Python stdlib guest agent (src/smolvm/guest_agent/agent.py). The Python agent was the right call to land the architecture, but it's the wrong long-term form for a lightweight-microVM platform.
The cost breakdown (what's real vs. not):
| Cost |
Verdict |
| Per-command latency |
Not the problem — the agent is a resident daemon, so requests don't pay interpreter startup; each run is a fork+exec of /bin/sh, same as SSH's exec. |
| Boot time |
Minor, one-time: importing stdlib modules at /init is ~tens of ms. |
| Resident memory |
Real — CPython holding socket/selectors/subprocess/... is ~10–20 MB RSS per VM. At 200+ VM density that's multiple GB doing nothing but listening. |
| Image size |
Real — python3 (~50 MB on Alpine) was added to the minimal SSH images that didn't have it, contradicting the "boots in seconds, minimal footprint" goal. |
A static Rust binary fixes both: ~1–3 MB binary, ~1–2 MB idle RSS, instant start, and no python3 in the image. Rust (not Go) because the repo already has a Cargo workspace (smolvm-core) + CI + a size-tuned [profile.release] (lto = "fat", strip = "symbols").
Scope
The wire protocol is the contract (src/smolvm/comm/protocol.py, PROTOCOL_VERSION = 1), so the blast radius is contained:
- Unchanged:
comm/protocol.py, VsockChannel (host), comm/select.py, CID allocation, QEMU device wiring, facade seam, all CLI.
- Changes:
guest_agent/agent.py → a Rust binary crate; images/builder.py COPYs the arch-matched static binary to /usr/local/bin/smolvm-guest-agent and drops the python3 install; /init execs it before sshd.
Design
Crate: new workspace member guest-agent/ (binary), not sharing smolvm-core's deps (no pyo3/tokio). Deps: vsock (rust-vmm — Vsock{Listener,Stream} implement std Read/Write), serde_json (+ serde), libc (setsid/killpg, already in tree).
Behavior parity (1:1 with the Python handlers):
- Framing:
>BI header, 16 MiB cap, 64 KiB chunks, frame tags 1–5.
run: login → $SHELL -lc (fallback /bin/sh), raw → /bin/sh -c; inherit env + overrides; setsid via pre_exec; stdin null; timeout → kill(-(pid), SIGKILL).
put_file: O_EXCL tempfile in parent dir → drain exactly size bytes even if open failed → chmod → atomic rename. Use std::path::absolute for os.path.abspath lexical semantics.
get_file: stat → reject non-regular → report errors before the header frame → stream DATA + EOF.
- Server: bind
VMADDR_CID_ANY:1024, thread-per-conn bounded by a 64-permit semaphore.
The two Rust-specific traps to get right:
- Concurrent stdout/stderr + timeout without
selectors/async: one reader thread per pipe → bounded sync_channel (backpressure); the main thread is the sole writer to the connection (no mutex needed, no stray frame after the terminator). recv_timeout drives the deadline.
- SIGPIPE: Rust installs
SIG_IGN by default, so writes to a closed conn return EPIPE instead of killing the daemon — treat as peer-hangup. catch_unwind per connection; keep panic = unwind (a daemon must survive one bad request), not abort.
Anti-drift testing (preserve today's guarantee, stronger): today test_guest_agent.py imports the Python agent and drives it over a socketpair with the host helpers. Give the Rust agent a test-only transport mode (--listen unix:<path> / inherited fd) so the test spawns the real binary, connects an AF_UNIX socketpair, and runs the unchanged comm/protocol.py host functions against it — a cross-language conformance test against the shipping artifact, runnable on any CI box (no vhost_vsock needed). Real-vsock E2E stays a separate Linux-gated integration test. Optionally keep the Python agent as a reference oracle.
Build & bake: cross-compile x86_64-/aarch64-unknown-linux-musl (static; runs in both Alpine/musl and Debian/glibc guests). scripts/build-guest-agent.sh <arch> as the single source of truth for CI artifacts + local image builds (cached). Toolchain: cargo-zigbuild unless the published-images workflow already has a cross/manylinux pattern to match.
Open decisions
Acceptance
Follow-up to #310.
Background
PR #310 introduces the vsock host↔guest control channel with a Python stdlib guest agent (
src/smolvm/guest_agent/agent.py). The Python agent was the right call to land the architecture, but it's the wrong long-term form for a lightweight-microVM platform.The cost breakdown (what's real vs. not):
runis afork+execof/bin/sh, same as SSH'sexec./initis ~tens of ms.socket/selectors/subprocess/...is ~10–20 MB RSS per VM. At 200+ VM density that's multiple GB doing nothing but listening.python3(~50 MB on Alpine) was added to the minimal SSH images that didn't have it, contradicting the "boots in seconds, minimal footprint" goal.A static Rust binary fixes both: ~1–3 MB binary, ~1–2 MB idle RSS, instant start, and no
python3in the image. Rust (not Go) because the repo already has a Cargo workspace (smolvm-core) + CI + a size-tuned[profile.release](lto = "fat",strip = "symbols").Scope
The wire protocol is the contract (
src/smolvm/comm/protocol.py,PROTOCOL_VERSION = 1), so the blast radius is contained:comm/protocol.py,VsockChannel(host),comm/select.py, CID allocation, QEMU device wiring, facade seam, all CLI.guest_agent/agent.py→ a Rust binary crate;images/builder.pyCOPYs the arch-matched static binary to/usr/local/bin/smolvm-guest-agentand drops thepython3install;/initexecs it before sshd.Design
Crate: new workspace member
guest-agent/(binary), not sharingsmolvm-core's deps (no pyo3/tokio). Deps:vsock(rust-vmm —Vsock{Listener,Stream}implement stdRead/Write),serde_json(+serde),libc(setsid/killpg, already in tree).Behavior parity (1:1 with the Python handlers):
>BIheader, 16 MiB cap, 64 KiB chunks, frame tags 1–5.run: login →$SHELL -lc(fallback/bin/sh), raw →/bin/sh -c; inherit env + overrides;setsidviapre_exec; stdin null; timeout →kill(-(pid), SIGKILL).put_file: O_EXCL tempfile in parent dir → drain exactlysizebytes even if open failed → chmod → atomic rename. Usestd::path::absoluteforos.path.abspathlexical semantics.get_file: stat → reject non-regular → report errors before the header frame → stream DATA + EOF.VMADDR_CID_ANY:1024, thread-per-conn bounded by a 64-permit semaphore.The two Rust-specific traps to get right:
selectors/async: one reader thread per pipe → boundedsync_channel(backpressure); the main thread is the sole writer to the connection (no mutex needed, no stray frame after the terminator).recv_timeoutdrives the deadline.SIG_IGNby default, so writes to a closed conn returnEPIPEinstead of killing the daemon — treat as peer-hangup.catch_unwindper connection; keeppanic = unwind(a daemon must survive one bad request), not abort.Anti-drift testing (preserve today's guarantee, stronger): today
test_guest_agent.pyimports the Python agent and drives it over a socketpair with the host helpers. Give the Rust agent a test-only transport mode (--listen unix:<path>/ inherited fd) so the test spawns the real binary, connects anAF_UNIXsocketpair, and runs the unchangedcomm/protocol.pyhost functions against it — a cross-language conformance test against the shipping artifact, runnable on any CI box (novhost_vsockneeded). Real-vsock E2E stays a separate Linux-gated integration test. Optionally keep the Python agent as a reference oracle.Build & bake: cross-compile
x86_64-/aarch64-unknown-linux-musl(static; runs in both Alpine/musl and Debian/glibc guests).scripts/build-guest-agent.sh <arch>as the single source of truth for CI artifacts + local image builds (cached). Toolchain:cargo-zigbuildunless the published-images workflow already has across/manylinux pattern to match.Open decisions
vsock+serde_json(recommended) vs near-zero-dep (rawlibcsocket + micro JSON).Acceptance
python3added to minimal images for the agent; resident RSS down to ~1–2 MB/VM.images/builder.py.vhost_vsock) green.Follow-up to #310.