Skip to content

Replace Python guest agent with a Rust static binary #311

@aniketmaurya

Description

@aniketmaurya

Background

PR #310 introduces the vsock host↔guest control channel with a Python stdlib guest agent (src/smolvm/guest_agent/agent.py). The Python agent was the right call to land the architecture, but it's the wrong long-term form for a lightweight-microVM platform.

The cost breakdown (what's real vs. not):

Cost Verdict
Per-command latency Not the problem — the agent is a resident daemon, so requests don't pay interpreter startup; each run is a fork+exec of /bin/sh, same as SSH's exec.
Boot time Minor, one-time: importing stdlib modules at /init is ~tens of ms.
Resident memory Real — CPython holding socket/selectors/subprocess/... is ~10–20 MB RSS per VM. At 200+ VM density that's multiple GB doing nothing but listening.
Image size Realpython3 (~50 MB on Alpine) was added to the minimal SSH images that didn't have it, contradicting the "boots in seconds, minimal footprint" goal.

A static Rust binary fixes both: ~1–3 MB binary, ~1–2 MB idle RSS, instant start, and no python3 in the image. Rust (not Go) because the repo already has a Cargo workspace (smolvm-core) + CI + a size-tuned [profile.release] (lto = "fat", strip = "symbols").

Scope

The wire protocol is the contract (src/smolvm/comm/protocol.py, PROTOCOL_VERSION = 1), so the blast radius is contained:

  • Unchanged: comm/protocol.py, VsockChannel (host), comm/select.py, CID allocation, QEMU device wiring, facade seam, all CLI.
  • Changes: guest_agent/agent.py → a Rust binary crate; images/builder.py COPYs the arch-matched static binary to /usr/local/bin/smolvm-guest-agent and drops the python3 install; /init execs it before sshd.

Design

Crate: new workspace member guest-agent/ (binary), not sharing smolvm-core's deps (no pyo3/tokio). Deps: vsock (rust-vmm — Vsock{Listener,Stream} implement std Read/Write), serde_json (+ serde), libc (setsid/killpg, already in tree).

Behavior parity (1:1 with the Python handlers):

  • Framing: >BI header, 16 MiB cap, 64 KiB chunks, frame tags 1–5.
  • run: login → $SHELL -lc (fallback /bin/sh), raw → /bin/sh -c; inherit env + overrides; setsid via pre_exec; stdin null; timeout → kill(-(pid), SIGKILL).
  • put_file: O_EXCL tempfile in parent dir → drain exactly size bytes even if open failed → chmod → atomic rename. Use std::path::absolute for os.path.abspath lexical semantics.
  • get_file: stat → reject non-regular → report errors before the header frame → stream DATA + EOF.
  • Server: bind VMADDR_CID_ANY:1024, thread-per-conn bounded by a 64-permit semaphore.

The two Rust-specific traps to get right:

  1. Concurrent stdout/stderr + timeout without selectors/async: one reader thread per pipe → bounded sync_channel (backpressure); the main thread is the sole writer to the connection (no mutex needed, no stray frame after the terminator). recv_timeout drives the deadline.
  2. SIGPIPE: Rust installs SIG_IGN by default, so writes to a closed conn return EPIPE instead of killing the daemon — treat as peer-hangup. catch_unwind per connection; keep panic = unwind (a daemon must survive one bad request), not abort.

Anti-drift testing (preserve today's guarantee, stronger): today test_guest_agent.py imports the Python agent and drives it over a socketpair with the host helpers. Give the Rust agent a test-only transport mode (--listen unix:<path> / inherited fd) so the test spawns the real binary, connects an AF_UNIX socketpair, and runs the unchanged comm/protocol.py host functions against it — a cross-language conformance test against the shipping artifact, runnable on any CI box (no vhost_vsock needed). Real-vsock E2E stays a separate Linux-gated integration test. Optionally keep the Python agent as a reference oracle.

Build & bake: cross-compile x86_64-/aarch64-unknown-linux-musl (static; runs in both Alpine/musl and Debian/glibc guests). scripts/build-guest-agent.sh <arch> as the single source of truth for CI artifacts + local image builds (cached). Toolchain: cargo-zigbuild unless the published-images workflow already has a cross/manylinux pattern to match.

Open decisions

  • Dependency minimalism: vsock + serde_json (recommended) vs near-zero-dep (raw libc socket + micro JSON).
  • Cross-compile toolchain: match the existing published-images CI if it already cross-builds.
  • Keep the Python agent as a test oracle, or retire it entirely.

Acceptance

  • No python3 added to minimal images for the agent; resident RSS down to ~1–2 MB/VM.
  • Cross-language conformance test against the real binary passes.
  • Static musl binaries for amd64 + arm64 produced in CI and baked by images/builder.py.
  • Real-vsock E2E (Linux + vhost_vsock) green.

Follow-up to #310.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions