feat(runtime): lazy-install matching ABI runtime tree for cross-Python roar run by christophergeyer · Pull Request #108 · treqs/roar

christophergeyer · 2026-05-15T19:02:30Z

Summary

Stacked on #107. Closes the loop on the cross-Python roar run story: when the traced Python's ABI doesn't match roar-cli's bundled deps, lazy-install a matching runtime tree at `~/.cache/roar/runtime//` and prepend it to `ROAR_RUNTIME_PYTHONPATH`. The traced process picks up ABI-correct `pydantic_core` / `blake3` from the cache, and the sitecustomize gate (whose detection is also refactored here) sees a matching SO on `sys.path` and lets backend dispatch proceed.

Trigger: `roar run`, not `roar init`. Init-time prefetch was tempting but the user might never invoke Python; `roar run` is the first moment a concrete target exists.

User-visible behavior

Cache miss — one `🦖 installing roar runtime for cp312 …` line on stderr, then the run proceeds.
Cache hit — silent. Stamp checks pin the cache to the installed roar version, so a roar upgrade invalidates.
`ROAR_RUNTIME_INSTALL=skip` (or `roar config set runtime.install skip`) — never lazy-install. Bundled-only. Useful in restricted-network containers; the docs companion PR (`treqs-inc/glaas-site#79`) recommends `pip install roar-cli` as the cleaner pattern for that case.
Install failure (no network, no `uv`/`pip` available, etc.) — falls through to the gate from fix(inject): gate backend dispatch when traced Python ABI != bundled #107. Stderr shows the gate's actionable message; file I/O still captured.

What changed

`roar/execution/runtime/abi_probe.py` (new) — subprocess-probes the target Python's `sys.implementation.cache_tag`. Bails for non-Python targets (`bash`, `make`, etc.) so unrelated `roar run` invocations don't pay the probe cost.
`roar/execution/runtime/lazy_install.py` (new) — XDG-respecting cache, atomic install via `uv pip install --target --python` (plain `pip` fallback), roar-version-stamped invalidation. `ensure_runtime(...)` is the orchestrator that callers (currently just the tracer launcher) use.
`runtime.install` config key — `auto` (default) | `skip`. Env `ROAR_RUNTIME_INSTALL` wins over config.
`TracerService._lazy_install_runtime_entries(...)` — calls into the probe + ensure_runtime right before setting `ROAR_RUNTIME_PYTHONPATH`.

Gate refactor (#107 area)

The original ABI-tag check in `sitecustomize.py` parsed roar's bundled `.so` filenames. That's correct for the bundled-only case but blind to a lazy-installed runtime tree on the same `sys.path`. Replaced with `matching_compiled_pydantic_core(sys.path, expected_soabi)` — walks sys.path for a `pydantic_core/*.so` whose filename matches the running interpreter's SOABI. Composes naturally with lazy-install: a matching SO anywhere satisfies the gate.

The old `bundled_abi_tag` + `abi_minor_version` helpers are retained (still tested) for future use and to keep the `support.py` surface stable.

Test plan

`test_abi_probe.py` — success / non-Python target / subprocess failure / blank stdout / versioned python names.
`test_lazy_install.py` — cache layout (XDG + home fallback), stamp invalidation, atomic install with mocked subprocess (real tempdir/rename), failure paths, mode resolution (env / config / default / case), and the `ensure_runtime` decision tree (match / skip / cache-hit / cache-miss-install / install-fail).
`test_inject_support.py` — `matching_compiled_pydantic_core` (bundled / lazy runtime / mismatch / missing pkg / wrong extension / blank entries).
Full unit suite: 968 passed, 1 skipped (pre-existing).
`ruff check` + `ruff format --check` clean.

Out of scope

Hidden `python -m roar.runtime install --python …` entrypoint for power-user pre-warming. Easy follow-up if needed.
Cache management UX (`list`, `remove`). Same — follow-up.
Docs page update for `/docs/installation` describing the lazy-install behavior. Will follow on glaas-site.

Merge sequencing

Stacked on `cg/abi-gate-backend-init` (PR #107). When #107 lands on `main`, this branch rebases cleanly. The diff against #107's branch is the meat of this PR; the diff against main shows both PRs' changes together.

🤖 Generated with Claude Code

…n roar run Solves the cross-Python first-run experience: `uv tool install roar-cli` installs roar under one CPython (typically uv's default), but the user runs `roar run python3 …` against a different one. Without a matching ABI tree, backend dispatch loads roar's bundled (wrong-ABI) compiled deps and crashes inside pydantic_core. The previous PR (#107) gates that crash gracefully. This one closes the loop: probe the target Python's ABI before launching, install a matching runtime tree at `~/.cache/roar/runtime/<tag>/` on the fly (or use the existing cache), and prepend it to ROAR_RUNTIME_PYTHONPATH so the traced process picks up ABI-correct compiled deps. What's added - `roar/execution/runtime/abi_probe.py` — `probe_python_abi(executable)` runs the target Python in a one-shot subprocess to read its `sys.implementation.cache_tag`. Bails fast for non-Python targets (bash/make/etc.) so `roar run cmd` doesn't pay a probe cost for things it couldn't lazy-install for anyway. - `roar/execution/runtime/lazy_install.py` — XDG-respecting cache (`$XDG_CACHE_HOME/roar/runtime/<tag>/` or `~/.cache/...`), atomic install via `uv pip install --target --python` (with plain pip fallback), roar-version-stamped cache invalidation, and the `ensure_runtime(...)` orchestrator. Failures return None — the sitecustomize gate handles fallback. - `runtime.install` config key — `auto` (default, lazy install on mismatch) or `skip` (use bundled only; backend dispatch off on mismatch). Env `ROAR_RUNTIME_INSTALL` overrides. The `skip` mode covers restricted-network containers where lazy-install would fail anyway. - `TracerService._lazy_install_runtime_entries(command, roar_dir)` — hooks the probe + ensure_runtime into `execute()` right before `ROAR_RUNTIME_PYTHONPATH` is set. Gate refactor (touches #107 territory) The original ABI-tag check (`bundled_abi_tag` + `abi_minor_version`) parsed roar's bundled `.so` filenames. That works for the bundled-only case but is blind to a lazy-installed runtime tree on the same path. Replaced with `matching_compiled_pydantic_core(sys.path, expected_soabi)` — walks sys.path for a pydantic_core SO whose filename matches the running interpreter's SOABI. Composes naturally with the lazy-install path (matching SO anywhere on sys.path satisfies the gate). The old helpers are kept (still tested) for future use and to avoid churning the API surface of `support.py`. User-visible - Lazy-install path emits a single 🦖 line on cache miss: 🦖 installing roar runtime for cp312 ... - Cache hits are silent. - Skip mode + ABI mismatch falls through to the gate's actionable message, which now also recommends `pip install roar-cli` for single-Python container environments. Test coverage - `test_abi_probe.py`: success / non-Python target / subprocess failure / blank stdout / versioned python names. - `test_lazy_install.py`: cache layout, stamp invalidation, atomic install (mocked subprocess, real tempdir/rename), failure paths, mode resolution (env / config / default / case normalization), and the `ensure_runtime` decision tree (match / skip / cache-hit / cache-miss-install / install-fail). - `test_inject_support.py`: `matching_compiled_pydantic_core` finds in bundled, finds in lazy runtime, returns False on mismatch / missing pkg / wrong extension / blank entries. 968 unit tests passing, 1 pre-existing skipped. Stacked on #107 (the gate). When #107 merges, this branch rebases cleanly onto main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Friction-journal feedback on #108: after the cross-Python fixes, `roar show @N` reports `Environment: ... Python 3.13.x` for jobs whose command was system `python3` (3.12). The Python identity in job metadata was being captured from `platform.python_version()` in roar's own host process (`runtime_collector.collect`), not from the traced child. Right at the seam roar is selling reproducibility on. Captures `python_version` + `python_implementation` from `platform.{python_version,python_implementation}()` *inside* the traced process in `tracker.write_log`, threads them through the `PythonInjectData` model, and has `runtime_collector` prefer the traced values (falling back to host values when the inject log is missing the fields, e.g. older logs). Also: stderr remediation in `sitecustomize.py` now suggests `uv tool install --python pythonX.Y roar-cli --reinstall` instead of `--force`. Same effect, accurate user-facing semantics, less alarming. Tests: - `test_runtime_tracker_writes_expected_log_payload` asserts `python_version` (with at least 2 dots, e.g. "3.12.3") and a non-empty `python_implementation`. - 968 unit tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

christophergeyer · 2026-05-15T19:14:38Z

Both review points addressed in the latest push (74958a1):

1. `roar show @N` now reflects the traced child's Python, not roar's host

You were right that the Environment block was misrepresenting which Python ran the user's code — exactly the wrong place for that to drift. Root cause: runtime_collector.collect() was calling platform.python_version() in roar's own process, then writing the result onto the job's metadata regardless of what the traced child actually was.

Fix:

tracker.write_log() now writes python_version + python_implementation into the inject log, captured from inside the traced child via platform.python_version() / platform.python_implementation().
PythonInjectData model gains both fields (defaults "" for backward-compat with older saved logs).
runtime_collector.collect() prefers python_data.python_version / python_data.python_implementation and falls back to host platform.* only when the inject log is empty/silent on them.

After this change a roar run invoked under system python3 (3.12) records Python 3.12.x even when roar-cli itself was installed under 3.13.

2. `--force` → `--reinstall` in the remediation message

Fixed. The gate's stderr now reads:

- Reinstall roar-cli under matching Python:
    uv tool install --python python3.12 roar-cli --reinstall

Tests added (existing tracker test now asserts the two new payload keys); 968 unit tests passing.

…system Friction-journal reported two bugs blocking the lazy-install path: 1. `from roar import __version__` failed because `roar/__init__.py` was empty. The tracer launcher's `try/except Exception` swallowed it as a debug log, so the entire `_lazy_install_runtime_entries` path returned [] silently — no 🦖 line, fall through to the gate. Dead code on every install. Adds canonical `__version__` to `roar/__init__.py` via `importlib.metadata.version("roar-cli")` with a sentinel fallback when metadata is absent. Narrows the launcher's except to ImportError and bumps the log level to warning — an ImportError on these internal modules is a contract violation, not a user environment issue, and shouldn't hide. 2. `sitecustomize._append_roar_runtime_pythonpath` *appended* ROAR_RUNTIME_PYTHONPATH entries to sys.path, so the lazy-installed cache landed at the END. System site-packages (with stale typing_extensions / pydantic) won. pydantic_core loaded from the cache but its `from typing_extensions import Sentinel` resolved to the system's older copy and crashed exactly like before. Renamed to `_prepend_roar_runtime_pythonpath`. Uses `sys.path[:0] = new_paths` to prepend the whole list in declared order, so the cache dir lands at `sys.path[0]`. The original "let the workload's own venv keep precedence" intent — preserved by appending — only mattered when the user already had matching deps. In the lazy-install scenario the gate didn't trip, which means the user's env didn't have matching deps; roar's ABI-matched copy is the better answer. Also addresses the user's diagnostic comment: `probe_python_abi`'s narrow `except (OSError, SubprocessError)` let a test-mocked subprocess.Popen leak a ValueError out into the launcher. Broadened to `except Exception` with a comment — the probe's contract is "tag or None", weird mock returns degrade to None. 🦖 message gains a one-time/cached tail so a slow network doesn't read as a hang: 🦖 installing roar runtime for cp312 ... (one-time per Python; cached) Tests: - `tests/unit/test_roar_version.py` — `from roar import __version__` must resolve to a non-empty string. Catches the regression at the package level so it can't slip through. - `tests/execution/runtime/test_sitecustomize_path_order.py` — subprocess test: when roar isn't already importable, the entries from ROAR_RUNTIME_PYTHONPATH land at the front of sys.path in declared order. When roar IS already importable (the merged case), the function early-returns and leaves sys.path alone. 971 unit tests passing, 1 pre-existing skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

christophergeyer · 2026-05-15T19:25:48Z

Both lazy-install blockers fixed in c5a1d27 (just pushed).

1. `roar.version` exists now

roar/__init__.py declares the canonical __version__ via importlib.metadata.version("roar-cli"). The tracer launcher's try/except Exception is narrowed to ImportError and bumped to warning log level — an import failure here is an internal contract violation, not a user-env thing, and shouldn't hide.

Regression caught by a new package-level test that imports from roar import __version__ and asserts it's a non-empty string.

2. Cache now beats system site-packages

The injection function (now correctly named _prepend_roar_runtime_pythonpath) does sys.path[:0] = new_paths instead of sys.path.append. The cache dir lands at sys.path[0], so the ABI-matched typing_extensions / pydantic_core win over the system's stale copies.

Per your reasoning: the prepend "wins" only when the user's env was already mismatched (else the gate would have passed and lazy-install wouldn't fire) — and in that case roar's copy is the right answer. Documented in the docstring.

Subprocess test added that exercises both the prepend codepath (entries land at sys.path[0..N] in declared order) and the early-return path (when roar is already importable, no-op).

Bonus fixes from your report

probe_python_abi had except (OSError, SubprocessError) — a test-mocked subprocess.Popen could leak a ValueError out. Broadened to except Exception with a comment that the probe's contract is "tag or None."

🦖 message gained a tail so a slow network doesn't read as a hang:

🦖 installing roar runtime for cp312 ... (one-time per Python; cached)

971 unit tests, 1 pre-existing skipped. Ruff clean.

Note on your monkey-patched __init__.py on the test machine: once this lands and you uv tool install --reinstall …, the canonical version will be in place.

…seam Friction-journal caught the missing reader-side half of 74958a1: the tracker writes `python_version` and `python_implementation` into the inject JSON, but `DataLoaderService.load_python_data` was constructing `PythonInjectData` without extracting those keys. They defaulted to "" and `runtime_collector` then fell back to `platform.python_version()` — host Python, not the traced child's. So `roar show @N` still reported the host's 3.13.12 for a 3.12 traced run. Two missing lines in data_loader.py plus tests at the seam: - `test_python_identity_keys_flow_through` — explicit reader-side test: python_version + python_implementation make it from a JSON payload into the model. - `test_python_identity_keys_default_to_empty_for_older_logs` — back-compat: older inject logs without the keys still load cleanly. - `test_writer_reader_roundtrip_carries_python_identity` — the reviewer's specifically-requested coverage: the real tracker writes, the real loader reads, the version comes out non-empty. Would have caught both halves of the original asymmetry — if either side regresses, this test fails. 974 unit tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

christophergeyer · 2026-05-15T19:34:03Z

Reviewer's catch addressed in `b1cd824` (just pushed).

You were right — the writer/reader asymmetry was the textbook missing half. `tracker.write_log` was writing `python_version` / `python_implementation` into the inject JSON, but `DataLoaderService.load_python_data` wasn't extracting those keys when constructing `PythonInjectData`. They defaulted to `""`, so `runtime_collector` hit the `or platform.python_version()` fallback and stamped roar's host Python on the job. Two missing lines.

Tests added at the seam — the third one is the round-trip you specifically asked for:

`test_python_identity_keys_flow_through` — reader-side: JSON in, model out, keys present.
`test_python_identity_keys_default_to_empty_for_older_logs` — back-compat: older inject logs without the keys load cleanly.
`test_writer_reader_roundtrip_carries_python_identity` — real `RuntimeInjectionTracker.write_log` writes, real `DataLoaderService.load_python_data` reads, asserts `python_version` survives. Would have caught both halves; if either side regresses, this test fails.

In-process can only round-trip with the test runner's interpreter, so it asserts the seam (non-empty, dot-separated version) rather than a specific value. The end-to-end "actual mismatched Python" test you described is still the strongest version — that's worth doing as an integration test, but it needs a second CPython on PATH in CI. Worth following up on after the merge.

974 unit tests, 1 skipped, ruff clean.

CI's repo-wide `ruff format --check` caught a few black-style line wraps in the lazy-install test file that my targeted format runs earlier in this branch missed. Pure formatting; no behavioral change.

…install `runtime_install_mode(start_dir: Path | None = None)` was passing a `Path` through to `config_get`, which is typed as `str | None`. mypy caught it in CI. One-line coercion at the call site. Full-repo mypy (361 source files) clean.

…able CI's test-macos (macos-15-intel, x86_64) saw two integration-test timeouts on this branch — `test_dag_deep_nesting` and `test_register_private_without_project_binding_uses_current_user_scope`. Both run several `roar run` invocations in a single test and ride the 60s pytest-timeout. macOS arm64 and Linux all passed; the smoking gun is Intel macOS framework Python's notoriously slow startup combined with the probe subprocess this branch adds to every `roar run`. The modal case for `roar run python foo.py` is that the target Python is exactly the one roar-cli itself runs under (especially in tests where the fixture provides `python_exe` = the test runner's interpreter). In that case the ABI matches by construction and the probe subprocess is pure overhead — N python startups per N-step test. Adds a fast path that resolves both `command[0]` and `sys.executable` through `shutil.which` + `os.path.realpath` and returns early on match. Probe still runs when the target genuinely differs (the cross-Python case lazy-install was built for). Verified locally: - ruff check + ruff format --check (repo-wide) - mypy roar (361 source files) - full unit + execution test sweep (974 passed, 1 skipped)

chrisgeyertreqs and others added 2 commits May 15, 2026 19:02

chrisgeyertreqs added 3 commits May 15, 2026 19:35

style: apply ruff format to test_lazy_install.py

22a035f

CI's repo-wide `ruff format --check` caught a few black-style line wraps in the lazy-install test file that my targeted format runs earlier in this branch missed. Pure formatting; no behavioral change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(runtime): lazy-install matching ABI runtime tree for cross-Python roar run#108

feat(runtime): lazy-install matching ABI runtime tree for cross-Python roar run#108
christophergeyer wants to merge 7 commits into
cg/abi-gate-backend-initfrom
cg/lazy-install-runtime

christophergeyer commented May 15, 2026

Uh oh!

christophergeyer commented May 15, 2026

Uh oh!

christophergeyer commented May 15, 2026

Uh oh!

christophergeyer commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

christophergeyer commented May 15, 2026

Summary

User-visible behavior

What changed

Gate refactor (#107 area)

Test plan

Out of scope

Merge sequencing

Uh oh!

christophergeyer commented May 15, 2026

1. roar show @N now reflects the traced child's Python, not roar's host

2. --force → --reinstall in the remediation message

Uh oh!

christophergeyer commented May 15, 2026

1. roar.__version__ exists now

2. Cache now beats system site-packages

Bonus fixes from your report

Uh oh!

christophergeyer commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `roar show @N` now reflects the traced child's Python, not roar's host

2. `--force` → `--reinstall` in the remediation message

1. `roar.version` exists now