Skip to content

feat(runtime): lazy-install matching ABI runtime tree for cross-Python roar run#108

Open
christophergeyer wants to merge 7 commits into
cg/abi-gate-backend-initfrom
cg/lazy-install-runtime
Open

feat(runtime): lazy-install matching ABI runtime tree for cross-Python roar run#108
christophergeyer wants to merge 7 commits into
cg/abi-gate-backend-initfrom
cg/lazy-install-runtime

Conversation

@christophergeyer
Copy link
Copy Markdown
Member

Summary

Stacked on #107. Closes the loop on the cross-Python roar run story: when the traced Python's ABI doesn't match roar-cli's bundled deps, lazy-install a matching runtime tree at `~/.cache/roar/runtime//` and prepend it to `ROAR_RUNTIME_PYTHONPATH`. The traced process picks up ABI-correct `pydantic_core` / `blake3` from the cache, and the sitecustomize gate (whose detection is also refactored here) sees a matching SO on `sys.path` and lets backend dispatch proceed.

Trigger: `roar run`, not `roar init`. Init-time prefetch was tempting but the user might never invoke Python; `roar run` is the first moment a concrete target exists.

User-visible behavior

  • Cache miss — one `🦖 installing roar runtime for cp312 …` line on stderr, then the run proceeds.
  • Cache hit — silent. Stamp checks pin the cache to the installed roar version, so a roar upgrade invalidates.
  • `ROAR_RUNTIME_INSTALL=skip` (or `roar config set runtime.install skip`) — never lazy-install. Bundled-only. Useful in restricted-network containers; the docs companion PR (`treqs-inc/glaas-site#79`) recommends `pip install roar-cli` as the cleaner pattern for that case.
  • Install failure (no network, no `uv`/`pip` available, etc.) — falls through to the gate from fix(inject): gate backend dispatch when traced Python ABI != bundled #107. Stderr shows the gate's actionable message; file I/O still captured.

What changed

  • `roar/execution/runtime/abi_probe.py` (new) — subprocess-probes the target Python's `sys.implementation.cache_tag`. Bails for non-Python targets (`bash`, `make`, etc.) so unrelated `roar run` invocations don't pay the probe cost.
  • `roar/execution/runtime/lazy_install.py` (new) — XDG-respecting cache, atomic install via `uv pip install --target --python` (plain `pip` fallback), roar-version-stamped invalidation. `ensure_runtime(...)` is the orchestrator that callers (currently just the tracer launcher) use.
  • `runtime.install` config key — `auto` (default) | `skip`. Env `ROAR_RUNTIME_INSTALL` wins over config.
  • `TracerService._lazy_install_runtime_entries(...)` — calls into the probe + ensure_runtime right before setting `ROAR_RUNTIME_PYTHONPATH`.

Gate refactor (#107 area)

The original ABI-tag check in `sitecustomize.py` parsed roar's bundled `.so` filenames. That's correct for the bundled-only case but blind to a lazy-installed runtime tree on the same `sys.path`. Replaced with `matching_compiled_pydantic_core(sys.path, expected_soabi)` — walks sys.path for a `pydantic_core/*.so` whose filename matches the running interpreter's SOABI. Composes naturally with lazy-install: a matching SO anywhere satisfies the gate.

The old `bundled_abi_tag` + `abi_minor_version` helpers are retained (still tested) for future use and to keep the `support.py` surface stable.

Test plan

  • `test_abi_probe.py` — success / non-Python target / subprocess failure / blank stdout / versioned python names.
  • `test_lazy_install.py` — cache layout (XDG + home fallback), stamp invalidation, atomic install with mocked subprocess (real tempdir/rename), failure paths, mode resolution (env / config / default / case), and the `ensure_runtime` decision tree (match / skip / cache-hit / cache-miss-install / install-fail).
  • `test_inject_support.py` — `matching_compiled_pydantic_core` (bundled / lazy runtime / mismatch / missing pkg / wrong extension / blank entries).
  • Full unit suite: 968 passed, 1 skipped (pre-existing).
  • `ruff check` + `ruff format --check` clean.

Out of scope

  • Hidden `python -m roar.runtime install --python …` entrypoint for power-user pre-warming. Easy follow-up if needed.
  • Cache management UX (`list`, `remove`). Same — follow-up.
  • Docs page update for `/docs/installation` describing the lazy-install behavior. Will follow on glaas-site.

Merge sequencing

Stacked on `cg/abi-gate-backend-init` (PR #107). When #107 lands on `main`, this branch rebases cleanly. The diff against #107's branch is the meat of this PR; the diff against main shows both PRs' changes together.

🤖 Generated with Claude Code

chrisgeyertreqs and others added 2 commits May 15, 2026 19:02
…n roar run

Solves the cross-Python first-run experience: `uv tool install roar-cli`
installs roar under one CPython (typically uv's default), but the user
runs `roar run python3 …` against a different one. Without a matching
ABI tree, backend dispatch loads roar's bundled (wrong-ABI) compiled
deps and crashes inside pydantic_core.

The previous PR (#107) gates that crash gracefully. This one closes the
loop: probe the target Python's ABI before launching, install a matching
runtime tree at `~/.cache/roar/runtime/<tag>/` on the fly (or use the
existing cache), and prepend it to ROAR_RUNTIME_PYTHONPATH so the
traced process picks up ABI-correct compiled deps.

What's added

- `roar/execution/runtime/abi_probe.py` — `probe_python_abi(executable)`
  runs the target Python in a one-shot subprocess to read its
  `sys.implementation.cache_tag`. Bails fast for non-Python targets
  (bash/make/etc.) so `roar run cmd` doesn't pay a probe cost for
  things it couldn't lazy-install for anyway.
- `roar/execution/runtime/lazy_install.py` — XDG-respecting cache
  (`$XDG_CACHE_HOME/roar/runtime/<tag>/` or `~/.cache/...`), atomic
  install via `uv pip install --target --python` (with plain pip
  fallback), roar-version-stamped cache invalidation, and the
  `ensure_runtime(...)` orchestrator. Failures return None — the
  sitecustomize gate handles fallback.
- `runtime.install` config key — `auto` (default, lazy install on
  mismatch) or `skip` (use bundled only; backend dispatch off on
  mismatch). Env `ROAR_RUNTIME_INSTALL` overrides. The `skip` mode
  covers restricted-network containers where lazy-install would fail
  anyway.
- `TracerService._lazy_install_runtime_entries(command, roar_dir)` —
  hooks the probe + ensure_runtime into `execute()` right before
  `ROAR_RUNTIME_PYTHONPATH` is set.

Gate refactor (touches #107 territory)

The original ABI-tag check (`bundled_abi_tag` + `abi_minor_version`)
parsed roar's bundled `.so` filenames. That works for the bundled-only
case but is blind to a lazy-installed runtime tree on the same path.
Replaced with `matching_compiled_pydantic_core(sys.path, expected_soabi)`
— walks sys.path for a pydantic_core SO whose filename matches the
running interpreter's SOABI. Composes naturally with the lazy-install
path (matching SO anywhere on sys.path satisfies the gate).

The old helpers are kept (still tested) for future use and to avoid
churning the API surface of `support.py`.

User-visible

- Lazy-install path emits a single 🦖 line on cache miss:
    🦖 installing roar runtime for cp312 ...
- Cache hits are silent.
- Skip mode + ABI mismatch falls through to the gate's actionable
  message, which now also recommends `pip install roar-cli` for
  single-Python container environments.

Test coverage

- `test_abi_probe.py`: success / non-Python target / subprocess failure
  / blank stdout / versioned python names.
- `test_lazy_install.py`: cache layout, stamp invalidation, atomic
  install (mocked subprocess, real tempdir/rename), failure paths,
  mode resolution (env / config / default / case normalization), and
  the `ensure_runtime` decision tree (match / skip / cache-hit /
  cache-miss-install / install-fail).
- `test_inject_support.py`: `matching_compiled_pydantic_core` finds in
  bundled, finds in lazy runtime, returns False on mismatch / missing
  pkg / wrong extension / blank entries.

968 unit tests passing, 1 pre-existing skipped.

Stacked on #107 (the gate). When #107 merges, this branch rebases
cleanly onto main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Friction-journal feedback on #108: after the cross-Python fixes,
`roar show @N` reports `Environment: ... Python 3.13.x` for jobs whose
command was system `python3` (3.12). The Python identity in job
metadata was being captured from `platform.python_version()` in roar's
own host process (`runtime_collector.collect`), not from the traced
child. Right at the seam roar is selling reproducibility on.

Captures `python_version` + `python_implementation` from
`platform.{python_version,python_implementation}()` *inside* the traced
process in `tracker.write_log`, threads them through the
`PythonInjectData` model, and has `runtime_collector` prefer the
traced values (falling back to host values when the inject log is
missing the fields, e.g. older logs).

Also: stderr remediation in `sitecustomize.py` now suggests
`uv tool install --python pythonX.Y roar-cli --reinstall` instead of
`--force`. Same effect, accurate user-facing semantics, less alarming.

Tests:
- `test_runtime_tracker_writes_expected_log_payload` asserts
  `python_version` (with at least 2 dots, e.g. "3.12.3") and a
  non-empty `python_implementation`.
- 968 unit tests passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@christophergeyer
Copy link
Copy Markdown
Member Author

Both review points addressed in the latest push (74958a1):

1. roar show @N now reflects the traced child's Python, not roar's host

You were right that the Environment block was misrepresenting which Python ran the user's code — exactly the wrong place for that to drift. Root cause: runtime_collector.collect() was calling platform.python_version() in roar's own process, then writing the result onto the job's metadata regardless of what the traced child actually was.

Fix:

  • tracker.write_log() now writes python_version + python_implementation into the inject log, captured from inside the traced child via platform.python_version() / platform.python_implementation().
  • PythonInjectData model gains both fields (defaults "" for backward-compat with older saved logs).
  • runtime_collector.collect() prefers python_data.python_version / python_data.python_implementation and falls back to host platform.* only when the inject log is empty/silent on them.

After this change a roar run invoked under system python3 (3.12) records Python 3.12.x even when roar-cli itself was installed under 3.13.

2. --force--reinstall in the remediation message

Fixed. The gate's stderr now reads:

- Reinstall roar-cli under matching Python:
    uv tool install --python python3.12 roar-cli --reinstall

Tests added (existing tracker test now asserts the two new payload keys); 968 unit tests passing.

…system

Friction-journal reported two bugs blocking the lazy-install path:

1. `from roar import __version__` failed because `roar/__init__.py` was
   empty. The tracer launcher's `try/except Exception` swallowed it as
   a debug log, so the entire `_lazy_install_runtime_entries` path
   returned [] silently — no 🦖 line, fall through to the gate. Dead
   code on every install.

   Adds canonical `__version__` to `roar/__init__.py` via
   `importlib.metadata.version("roar-cli")` with a sentinel fallback
   when metadata is absent. Narrows the launcher's except to
   ImportError and bumps the log level to warning — an ImportError on
   these internal modules is a contract violation, not a user
   environment issue, and shouldn't hide.

2. `sitecustomize._append_roar_runtime_pythonpath` *appended*
   ROAR_RUNTIME_PYTHONPATH entries to sys.path, so the lazy-installed
   cache landed at the END. System site-packages (with stale
   typing_extensions / pydantic) won. pydantic_core loaded from the
   cache but its `from typing_extensions import Sentinel` resolved to
   the system's older copy and crashed exactly like before.

   Renamed to `_prepend_roar_runtime_pythonpath`. Uses
   `sys.path[:0] = new_paths` to prepend the whole list in declared
   order, so the cache dir lands at `sys.path[0]`. The original "let
   the workload's own venv keep precedence" intent — preserved by
   appending — only mattered when the user already had matching deps.
   In the lazy-install scenario the gate didn't trip, which means the
   user's env didn't have matching deps; roar's ABI-matched copy is
   the better answer.

Also addresses the user's diagnostic comment: `probe_python_abi`'s
narrow `except (OSError, SubprocessError)` let a test-mocked
subprocess.Popen leak a ValueError out into the launcher. Broadened
to `except Exception` with a comment — the probe's contract is "tag
or None", weird mock returns degrade to None.

🦖 message gains a one-time/cached tail so a slow network doesn't
read as a hang:
  🦖 installing roar runtime for cp312 ... (one-time per Python; cached)

Tests:
- `tests/unit/test_roar_version.py` — `from roar import __version__`
  must resolve to a non-empty string. Catches the regression at the
  package level so it can't slip through.
- `tests/execution/runtime/test_sitecustomize_path_order.py` —
  subprocess test: when roar isn't already importable, the entries
  from ROAR_RUNTIME_PYTHONPATH land at the front of sys.path in
  declared order. When roar IS already importable (the merged case),
  the function early-returns and leaves sys.path alone.

971 unit tests passing, 1 pre-existing skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@christophergeyer
Copy link
Copy Markdown
Member Author

Both lazy-install blockers fixed in c5a1d27 (just pushed).

1. roar.__version__ exists now

roar/__init__.py declares the canonical __version__ via importlib.metadata.version("roar-cli"). The tracer launcher's try/except Exception is narrowed to ImportError and bumped to warning log level — an import failure here is an internal contract violation, not a user-env thing, and shouldn't hide.

Regression caught by a new package-level test that imports from roar import __version__ and asserts it's a non-empty string.

2. Cache now beats system site-packages

The injection function (now correctly named _prepend_roar_runtime_pythonpath) does sys.path[:0] = new_paths instead of sys.path.append. The cache dir lands at sys.path[0], so the ABI-matched typing_extensions / pydantic_core win over the system's stale copies.

Per your reasoning: the prepend "wins" only when the user's env was already mismatched (else the gate would have passed and lazy-install wouldn't fire) — and in that case roar's copy is the right answer. Documented in the docstring.

Subprocess test added that exercises both the prepend codepath (entries land at sys.path[0..N] in declared order) and the early-return path (when roar is already importable, no-op).

Bonus fixes from your report

  • probe_python_abi had except (OSError, SubprocessError) — a test-mocked subprocess.Popen could leak a ValueError out. Broadened to except Exception with a comment that the probe's contract is "tag or None."
  • 🦖 message gained a tail so a slow network doesn't read as a hang:
    🦖 installing roar runtime for cp312 ... (one-time per Python; cached)
    

971 unit tests, 1 pre-existing skipped. Ruff clean.

Note on your monkey-patched __init__.py on the test machine: once this lands and you uv tool install --reinstall …, the canonical version will be in place.

…seam

Friction-journal caught the missing reader-side half of 74958a1: the
tracker writes `python_version` and `python_implementation` into the
inject JSON, but `DataLoaderService.load_python_data` was constructing
`PythonInjectData` without extracting those keys. They defaulted to ""
and `runtime_collector` then fell back to `platform.python_version()`
— host Python, not the traced child's. So `roar show @N` still
reported the host's 3.13.12 for a 3.12 traced run.

Two missing lines in data_loader.py plus tests at the seam:

- `test_python_identity_keys_flow_through` — explicit reader-side test:
  python_version + python_implementation make it from a JSON payload
  into the model.
- `test_python_identity_keys_default_to_empty_for_older_logs` —
  back-compat: older inject logs without the keys still load cleanly.
- `test_writer_reader_roundtrip_carries_python_identity` — the
  reviewer's specifically-requested coverage: the real tracker writes,
  the real loader reads, the version comes out non-empty. Would have
  caught both halves of the original asymmetry — if either side
  regresses, this test fails.

974 unit tests passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@christophergeyer
Copy link
Copy Markdown
Member Author

Reviewer's catch addressed in `b1cd824` (just pushed).

You were right — the writer/reader asymmetry was the textbook missing half. `tracker.write_log` was writing `python_version` / `python_implementation` into the inject JSON, but `DataLoaderService.load_python_data` wasn't extracting those keys when constructing `PythonInjectData`. They defaulted to `""`, so `runtime_collector` hit the `or platform.python_version()` fallback and stamped roar's host Python on the job. Two missing lines.

Tests added at the seam — the third one is the round-trip you specifically asked for:

  • `test_python_identity_keys_flow_through` — reader-side: JSON in, model out, keys present.
  • `test_python_identity_keys_default_to_empty_for_older_logs` — back-compat: older inject logs without the keys load cleanly.
  • `test_writer_reader_roundtrip_carries_python_identity` — real `RuntimeInjectionTracker.write_log` writes, real `DataLoaderService.load_python_data` reads, asserts `python_version` survives. Would have caught both halves; if either side regresses, this test fails.

In-process can only round-trip with the test runner's interpreter, so it asserts the seam (non-empty, dot-separated version) rather than a specific value. The end-to-end "actual mismatched Python" test you described is still the strongest version — that's worth doing as an integration test, but it needs a second CPython on PATH in CI. Worth following up on after the merge.

974 unit tests, 1 skipped, ruff clean.

CI's repo-wide `ruff format --check` caught a few black-style line
wraps in the lazy-install test file that my targeted format runs
earlier in this branch missed. Pure formatting; no behavioral change.
…install

`runtime_install_mode(start_dir: Path | None = None)` was passing a
`Path` through to `config_get`, which is typed as `str | None`. mypy
caught it in CI. One-line coercion at the call site.

Full-repo mypy (361 source files) clean.
…able

CI's test-macos (macos-15-intel, x86_64) saw two integration-test
timeouts on this branch — `test_dag_deep_nesting` and
`test_register_private_without_project_binding_uses_current_user_scope`.
Both run several `roar run` invocations in a single test and ride the
60s pytest-timeout. macOS arm64 and Linux all passed; the smoking gun
is Intel macOS framework Python's notoriously slow startup combined
with the probe subprocess this branch adds to every `roar run`.

The modal case for `roar run python foo.py` is that the target Python
is exactly the one roar-cli itself runs under (especially in tests
where the fixture provides `python_exe` = the test runner's
interpreter). In that case the ABI matches by construction and the
probe subprocess is pure overhead — N python startups per N-step test.

Adds a fast path that resolves both `command[0]` and `sys.executable`
through `shutil.which` + `os.path.realpath` and returns early on match.
Probe still runs when the target genuinely differs (the cross-Python
case lazy-install was built for).

Verified locally:
- ruff check + ruff format --check (repo-wide)
- mypy roar (361 source files)
- full unit + execution test sweep (974 passed, 1 skipped)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants