diff --git a/README.md b/README.md index e69de29..83bda3b 100644 --- a/README.md +++ b/README.md @@ -0,0 +1,278 @@ +# Frontier SWE OpenEnv + +A family of long-horizon software-engineering environments for [OpenEnv](https://github.com/rycerzes/OpenEnv), packaged as Docker images and mirrored to Hugging Face Spaces. Each task exposes the same OpenEnv-shaped **FastAPI** surface (Gym-style `/reset`, `/step`, `/state`, `/health`) plus **MCP** tools for planning and submission. A **composite rubric** (workspace gates, structured or regex-based L1 scores, optional LLM code and plan review) aggregates into a normalised episode reward. + +This repository is organised like a small monorepo: shared Python server and client live under `frontier_swe_env/`, task assets under `tasks//`, and each deployable Space under `spaces//` (Dockerfile, README with HF card front matter, and `openenv.yaml`). + +These environments are **adapted from the [FrontierSWE](https://www.frontierswe.com/) benchmark** ([`proximal-labs/frontier-swe`](https://github.com/proximal-labs/frontier-swe) on GitHub): long-horizon systems and performance problems repackaged as OpenEnv-shaped services with a shared rubric and MCP tooling. The **Tasks** table below links each OpenEnv task **one-to-one** to its official FrontierSWE write-up. + +## Features + +- **Shared runtime**: One FastMCP/OpenEnv stack per image; task-specific workspace, verifier, and instructions are baked into the image. +- **Gym-style control**: `POST /reset`, `POST /step`, `GET /state`, `GET /health` for training and evaluation harnesses. +- **MCP for agents**: OpenEnv JSON-RPC at `POST /mcp`, and Streamable HTTP for adapters at `/tools/mcp` (POST and GET/SSE). +- **Episode tools**: `submit_plan`, `submit_subtask`, `get_status`, `advance` (see `openenv.yaml` and each Space manifest). +- **Multi-layer scoring**: Gate scripts, L1 (tests, `reward.json`, or regex ratio), L2/L3 LLM judges when grader API env vars are set, then a weighted episode blend. + +## Tasks + +| Task ID | Domain | FrontierSWE write-up | OpenEnv manifest | Hugging Face Space | GHCR image | +| --- | --- | --- | --- | --- | --- | +| `notebook-compression` | Systems / compression | [Notebook compression](https://www.frontierswe.com/notebook-compression) | [`spaces/notebook/openenv.yaml`](spaces/notebook/openenv.yaml) | [rycerzes/frontier-swe-notebook](https://huggingface.co/spaces/rycerzes/frontier-swe-notebook) | `ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-notebook:latest` | +| `postgres-sqlite-wire-adapter` | Systems / databases / Zig | [PostgreSQL on SQLite](https://www.frontierswe.com/postgres-sqlite-wire-adapter) | [`spaces/postgres/openenv.yaml`](spaces/postgres/openenv.yaml) | [rycerzes/frontier-swe-postgres](https://huggingface.co/spaces/rycerzes/frontier-swe-postgres) | `ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-postgres:latest` | +| `dependent-type-checker` | PL / type theory | [Dependent type checker](https://www.frontierswe.com/dependent-type-checker) | [`spaces/type-checker/openenv.yaml`](spaces/type-checker/openenv.yaml) | [rycerzes/frontier-swe-type-checker](https://huggingface.co/spaces/rycerzes/frontier-swe-type-checker) | `ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-dependent-type-checker:latest` | +| `libexpat-to-x86asm` | Systems / x86-64 assembly / XML | [libexpat to assembly](https://www.frontierswe.com/libexpat-to-x86asm) | [`spaces/libexpat-to-x86asm/openenv.yaml`](spaces/libexpat-to-x86asm/openenv.yaml) | [rycerzes/frontier-swe-libexpat-to-x86asm](https://huggingface.co/spaces/rycerzes/frontier-swe-libexpat-to-x86asm) | `ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-libexpat-to-x86asm:latest` | + +Authoritative package metadata for tooling (for example `openenv pull`) lives in the root [`openenv.yaml`](openenv.yaml). + +## Task assets and runtime configuration + +The repo splits responsibilities in two places that sound similar but are **not** duplicates of each other: + +| Location | Role | +| --- | --- | +| [`tasks//`](tasks/) | **Problem pack** checked into git: human-facing `instruction.md`, verifier shell scripts, Python helpers such as `compute_reward.py`, hidden tests, datasets, and anything the Dockerfile `COPY`s into the image. This is where each task’s **reward semantics** are actually implemented (what gets run, what gets written to disk, what counts as a hard fail). | +| [`frontier_swe_env/tasks/`](frontier_swe_env/tasks/) | **Python registry** of [`TaskConfig`](frontier_swe_env/task_config.py) factories (`pg.py`, `notebook_compression.py`, …). Each module describes how the **running server** should drive scoring: paths **inside the container**, the L1 command string, `l1_score_mode`, JSON paths and anchors, timeouts, episode limits, and text used for L2/L3 LLM prompts. | + +**Build time.** Per-task Dockerfiles under [`docker/`](docker/) copy a slice of `tasks//` into fixed locations (for example verifier assets under `/opt/verifier/`, full instructions at `/app/instruction.md` or `/opt/task/instruction.md`, workspaces under `/app/...`). Those paths are what the verifier scripts assume. + +**Run time.** [`FrontierSweEnvironment`](frontier_swe_env/server/frontier_swe_env_environment.py) loads a `TaskConfig` via [`get_task_config`](frontier_swe_env/tasks/__init__.py). The task is selected with environment variables (defaults match the image): + +- `FSWE_TASK_NAME` — logical name (`postgres`, `notebook-compression`, `dependent-type-checker`, `libexpat-to-x86asm`, …); aliases like `pg` or `type-checker` map to the same factories. +- `FSWE_TASK_MODE` — `training` vs `demo` (different budgets, attempts, and sometimes instruction source). + +From that single config object the environment wires **shared** rubric classes to **task-specific** commands and parsers: + +1. **Gate checks** — shell script from `TaskConfig.gate_script_path` (baked from `tasks/...` into the image). +2. **L1** — [`TestOutputRubric`](frontier_swe_env/rubrics/l1_tests.py) runs `TaskConfig.visible_test_command`. Depending on `l1_score_mode`, it either parses **stdout** with a regex (`ratio` and similar) or reads a structured **`reward.json`** after the verifier finishes (`reward_json` vs `reward_json_score`). Each task’s verifier under `tasks//tests/` is responsible for producing the format its mode expects. +3. **L2 / L3** — LLM judges use `task_description`, `task_domain`, and `scoring_context` from `TaskConfig` so prompts stay aligned with that task even though the judge code is shared. +4. **Episode reward** — [`EpisodeRubric`](frontier_swe_env/rubrics/episode_rubric.py) blends plan quality, mean frozen subtask scores, completion, and tool usage using weights from the same `TaskConfig`. + +So: **`tasks/` defines what “correct” means operationally**; **`frontier_swe_env/tasks/` tells the server how to invoke and normalise that signal** inside the shared OpenEnv stack. + +**`spaces/*/openenv.yaml`.** These manifests document the Space for judges and tooling (rubric layers, metrics, HF metadata). They should stay **consistent** with the Python `TaskConfig` and Docker layout for the same task. The live server inside the image is driven by **`TaskConfig` + env vars**, not by parsing `openenv.yaml` at runtime. + +```mermaid +flowchart LR + subgraph repo["Git repo"] + TPACK["tasks/task-id/"] + TPY["frontier_swe_env/tasks/*.py"] + DOCK["docker/Dockerfile.*"] + end + subgraph image["Task Docker image"] + WS["Workspace /app/..."] + VER["Verifier /opt/verifier/"] + RJ["/logs/verifier/reward.json optional"] + end + subgraph runtime["Python server"] + CFG["TaskConfig"] + ENV["FrontierSweEnvironment"] + R1["Gates + L1 + L2 + L3 + EpisodeRubric"] + end + TPACK --> DOCK + DOCK --> WS + DOCK --> VER + TPY --> CFG + CFG --> ENV + VER --> RJ + ENV --> R1 + VER -.->|"subprocess"| R1 +``` + +### L1 score modes (per-task flavour) + +[`TaskConfig.l1_score_mode`](frontier_swe_env/task_config.py) selects how L1 turns verifier output into a number in \([0, 1]\): + +| Mode | Typical task | Meaning | +| --- | --- | --- | +| `ratio` | Postgres wire adapter | Regex on test runner stdout (`Total: N/M passed`). | +| `reward_json` | Notebook compression | Verifier writes JSON (e.g. `geom_mean_ratio`, `status`); normalisation is mode-specific in `TestOutputRubric`. | +| `reward_json_score` | Dependent type checker, libexpat assembly | Verifier writes a numeric `score` (field configurable); linear map between `reward_json_score_anchors`, optional hard-fail handling. | + +Adding a new task usually means: add `tasks/new-task/`, extend a Dockerfile to copy it, add `frontier_swe_env/tasks/new_task.py` plus [`register_task`](frontier_swe_env/tasks/__init__.py), and add a Space manifest under `spaces/`. + +## Task catalog + +Short descriptions of what each episode asks for and how **L1** is determined. (Gates, L2 code review, L3 plan review, and episode blending behave the same way structurally; only L1 and task copy differ.) + +### Notebook compression (`notebook-compression`) + +Agents implement a **lossless** Jupyter `.ipynb` codec as `/app/run` with `fit` / `compress` / `decompress` stages. The hidden verifier under [`tasks/notebook-compression/tests/`](tasks/notebook-compression/tests/) runs the full pipeline and writes [`reward.json`](tasks/notebook-compression/tests/compute_reward.py) with corpus-driven metrics; **byte-exact round-trip** failures are hard fails. Python config in [`notebook_compression.py`](frontier_swe_env/tasks/notebook_compression.py) sets `l1_score_mode="reward_json"`, long `l1_timeout_s`, and `scoring_context` for judges. Benchmark write-up: [FrontierSWE — Notebook compression](https://www.frontierswe.com/notebook-compression). + +### Postgres / SQLite wire adapter (`postgres-sqlite-wire-adapter`) + +Agents implement a **Zig** binary that speaks enough of the **PostgreSQL wire protocol** to satisfy a tiered compat suite while using **SQLite** for storage. L1 is primarily **`ratio`** mode: the configured command runs [`pg_compat_test.sh`](tasks/postgres-sqlite-wire-adapter/tests/pg_compat_test.sh)-style output and the rubric parses pass counts from stdout. Config and copy live in [`pg.py`](frontier_swe_env/tasks/pg.py) and [`tasks/postgres-sqlite-wire-adapter/`](tasks/postgres-sqlite-wire-adapter/). Benchmark write-up: [FrontierSWE — PostgreSQL on SQLite](https://www.frontierswe.com/postgres-sqlite-wire-adapter). + +### Dependent type checker (`dependent-type-checker`) + +Agents implement a **Rust** type checker for a small dependently typed surface language; the release binary is exercised by a large accept/reject corpus plus latency benchmarks vs a reference. The verifier emits **`reward_json_score`** with gates on accept/reject rates and anti-cheat signals in JSON. Anchors and timeouts are set in [`dependent_type_checker.py`](frontier_swe_env/tasks/dependent_type_checker.py); the heavy spec and tests live under [`tasks/dependent-type-checker/`](tasks/dependent-type-checker/). Benchmark write-up: [FrontierSWE — Dependent type checker](https://www.frontierswe.com/dependent-type-checker). + +### libexpat to x86-64 assembly (`libexpat-to-x86asm`) + +Agents produce **`/app/asm-port/libexpat.so`** implementing the **libexpat C ABI** in assembly (no vendored C core). The verifier builds reference C libexpat, runs upstream tests and benchmarks, and writes **`reward_json_score`** (correctness plus performance, with hard fails for missing `.so` or anti-cheat). See [`libexpat_to_x86asm.py`](frontier_swe_env/tasks/libexpat_to_x86asm.py) and [`tasks/libexpat-to-x86asm/`](tasks/libexpat-to-x86asm/). Benchmark write-up: [FrontierSWE — libexpat to x86-64 assembly](https://www.frontierswe.com/libexpat-to-x86asm). + +## Quick start + +### Install (Python 3.13) + +```bash +uv sync +``` + +Optional extras: + +```bash +uv sync --extra test +``` +For training on local +```bash +uv sync --extra training +``` + +### Run the API locally (development) + +The full task workspace and verifiers are intended to run inside the published Docker images. For a minimal local smoke test of the HTTP app only: + +```bash +uv run uvicorn frontier_swe_env.server.app:app --host 127.0.0.1 --port 8000 --reload +``` + +Then open `http://127.0.0.1:8000/health`. + +### Run a task image + +Replace the image tag with the task you need (see table above). Grader-related env vars are optional unless you want LLM rubric layers to run inside the container. + +```bash +docker run --rm -p 8000:8000 \ + -e FSWE_GRADER_MODEL=... \ + -e FSWE_GRADER_API_URL=... \ + -e FSWE_GRADER_API_KEY=... \ + ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-postgres:latest +``` + +For an end-to-end baseline over WebSocket (connect, `reset`, repeated `step`), see [`scripts/run_baseline.py`](scripts/run_baseline.py). + +## Python client + +```python +import asyncio +from frontier_swe_env.client import FrontierSweEnv +from frontier_swe_env.models import FrontierSweAction + + +async def main(): + client = FrontierSweEnv(base_url="http://localhost:8000") + await client.connect() + try: + result = await client.reset() + print(result.observation.phase) + result = await client.step(FrontierSweAction(message="Your turn")) + print(result.observation.response) + finally: + await client.close() + + +asyncio.run(main()) +``` + +The client maintains a WebSocket session to the server; see `FrontierSweEnv` in [`frontier_swe_env/client.py`](frontier_swe_env/client.py) for `from_docker_image` and timeout options. + +## MCP tools (all tasks) + +| Tool | Purpose | +| --- | --- | +| `submit_plan` | Propose subtasks (`id`, `description`, `acceptance_criteria`); moves PLANNING → EXECUTING. | +| `submit_subtask` | Run L1 + L2 scoring for the given `subtask_id`. | +| `get_status` | Snapshot of phase, scores, time remaining, feedback. | +| `advance` | Freeze the current subtask score and advance to the next. | + +Implementations are registered in [`frontier_swe_env/server/mcp_tools.py`](frontier_swe_env/server/mcp_tools.py). + +## Environment variables + +Typical deployment sets **agent** variables (for the in-container coding harness) and **grader** variables (for LLM rubric layers): + +| Prefix | Role | +| --- | --- | +| `FSWE_AGENT_MODEL`, `FSWE_AGENT_API_URL`, `FSWE_AGENT_API_KEY` | Agent LLM (also used to generate `/root/.pi/agent/models.json` in the entrypoint when `FSWE_AGENT_API_URL` is set). | +| `FSWE_GRADER_MODEL`, `FSWE_GRADER_API_URL`, `FSWE_GRADER_API_KEY` | LLM judges for L2/L3 layers in the rubric. | + +Exact behaviour is defined per task in each Space `openenv.yaml` under `rubric.layers`. + +## Hugging Face Spaces + +CI assembles a minimal Space directory (root `Dockerfile`, `README.md`, `openenv.yaml`) from `spaces//` via [`scripts/prepare_hf_space.py`](scripts/prepare_hf_space.py). The **HF — Sync** workflow pushes to `spaces/{HF_OWNER}/frontier-swe-{notebook|postgres|type-checker|libexpat-to-x86asm}` after images build on `main`. + +## Training (offline RL) + +A single Frontier SWE episode often runs on the order of **45 minutes to about 90 minutes**, depending on the task, verifier cost, and agent behaviour. That makes dense **online** RL on live environments impractical at scale, so this project uses **offline RL**: collect fixed trajectories, post-process rewards and hindsight signals, build a static training set, then fine-tune on Hugging Face with **Trackio** for metrics. + +For **why not GRPO/DPO alone**, **paper vs code** differences, and **equations** mapped to [`scripts/compute_hindsight_scores.py`](scripts/compute_hindsight_scores.py), [`scripts/build_hcapo_dataset.py`](scripts/build_hcapo_dataset.py), and [`training/train_hcapo.py`](training/train_hcapo.py), see [`training/README.md`](training/README.md). + +The walk-through below uses the **`postgres-sqlite-wire-adapter`** task as the reference pipeline. + +### Data collection and post-processing + +1. **Rollouts** — [`scripts/collect_trajectories.py`](scripts/collect_trajectories.py) was used to gather **20 episodes** on a **2× NVIDIA A100** host running **sglang**, with the agent powered by **[`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B)** (Qwen 3.6 27B). Run id **pg-01** labels this batch in tooling and dataset names. +2. **Backfill** — Some episodes finished without a persisted **`episode_reward`** because of a server-side bug; [`scripts/backfill_rewards.py`](scripts/backfill_rewards.py) was run to fill those fields from episode metadata. +3. **Hindsight** — [`scripts/compute_hindsight_scores.py`](scripts/compute_hindsight_scores.py) was run with the same **Qwen 3.6 27B** stack to attach per-step hindsight quantities (HCAPO-style) for training. For how that differs from the original HCAPO formulation (paper [2603.08754](https://arxiv.org/abs/2603.08754)), formulae, and design rationale, see [`training/README.md`](training/README.md). + +The **raw trajectory bundle** (per-episode `result.json`, `pi_session.jsonl`, `container_logs.txt`, optional `hindsight_scores.json`) is published on Hugging Face as **[`rycerzes/fswe-pg-01-traj-q36-27b`](https://huggingface.co/datasets/rycerzes/fswe-pg-01-traj-q36-27b)**. + +### HCAPO dataset build + +From a local `trajectories/` tree, the JSONL used for fine-tuning was produced with: + +```bash +uv run python scripts/build_hcapo_dataset.py \ + --input-dir trajectories \ + --output-dir datasets \ + --min-reward 0.05 \ + --omega 1.0 +``` + +The resulting **HCAPO training set** is **[`rycerzes/fswe-hcapo-pg-01-trajectories`](https://huggingface.co/datasets/rycerzes/fswe-hcapo-pg-01-trajectories)** (messages + step advantages derived from the pg-01 trajectories). + +### Fine-tuning run + +Training was launched with: + +```bash +./scripts/launch_hf_space.sh --with-dataset-upload +``` + +That configuration runs **3 epochs** over **18 optimizer steps** on the Space-backed trainer (dataset upload + run as implemented in [`scripts/launch_hf_space.sh`](scripts/launch_hf_space.sh)). + +**Metrics dashboard (Trackio on Hugging Face):** [`rycerzes/trackio`](https://huggingface.co/spaces/rycerzes/trackio) — run name **`fswe-hcapo-pg-01-qwen36-27b`**. + +![Trackio dashboard: loss, epoch, learning rate, gradient norm, and global step for fswe-hcapo-pg-01-qwen36-27b](assets/training-trackio-dashboard.png) + +The screenshot above (smoothing ≈ 20 on the step axis) shows a **post-training** phase on the HCAPO dataset: + +- **Loss** decreases from roughly **1.0** at the start of the plotted window to about **0.75** by the end (**~25%** relative drop), with noisy raw traces but a clear downward trend in the smoothed curve. +- **Epoch** advances linearly to approximately **2.7** over the **18** logged steps, consistent with targeting **3 epochs** in a short run. +- **Learning rate** follows a **warmup then decay**: it rises toward a peak near the middle of the run (on the order of **3.5×10⁻⁶**) and falls toward roughly **1.5×10⁻⁶** by the final steps. +- **Gradient norm** stays in a moderate band (mostly about **1.0–1.5**, ending near **1.2**), which suggests optimization without obvious gradient blow-ups for this snapshot. +- **Global step** in the sidebar advances in line with the trainer (e.g. into the low tens over the same window)q + +Together, these curves read as a **successful small-scale sanity fine-tune**: loss improves steadily, the LR schedule behaves as expected, and gradients remain bounded. + +## Repository layout + +- **`frontier_swe_env/`** — FastAPI app, [`FrontierSweEnvironment`](frontier_swe_env/server/frontier_swe_env_environment.py), shared rubrics, MCP tools, [`TaskConfig`](frontier_swe_env/task_config.py), task registry under [`frontier_swe_env/tasks/`](frontier_swe_env/tasks/), models, client. +- **`tasks//`** — Instructions, verifier scripts, rewards, and data **consumed at image build** (see [Task assets and runtime configuration](#task-assets-and-runtime-configuration)). +- **`docker/`** — Shared base image, per-task Dockerfiles, [`openenv_entrypoint.sh`](docker/openenv_entrypoint.sh) (uvicorn + optional pi models). +- **`spaces/`** — Thin HF Space wrappers: Dockerfile pin, README (HF card), `openenv.yaml` for external metadata. + +Each Space README under `spaces/*/README.md` is the human-facing description for that Hugging Face Space (including YAML front matter for the Space card). + +## Testing + +Task-specific verifiers and reward scripts live under `tasks//tests/`. There is no single top-level pytest suite yet; run task-local scripts as documented in each task directory when you change a verifier. + +## About + +**frontier-swe-openenv** packages Frontier-style long-horizon tasks for [OpenEnv](https://github.com/rycerzes/OpenEnv), adapted from **[FrontierSWE](https://www.frontierswe.com/)** ([`proximal-labs/frontier-swe`](https://github.com/proximal-labs/frontier-swe)). Official benchmark task pages for the four environments here: [postgres-sqlite-wire-adapter](https://www.frontierswe.com/postgres-sqlite-wire-adapter), [libexpat-to-x86asm](https://www.frontierswe.com/libexpat-to-x86asm), [dependent-type-checker](https://www.frontierswe.com/dependent-type-checker), [notebook-compression](https://www.frontierswe.com/notebook-compression). + +The OpenEnv runtime dependency is pinned in [`pyproject.toml`](pyproject.toml) (`openenv-core` git source). diff --git a/assets/blog.md b/assets/blog.md new file mode 100644 index 0000000..94a8e5b --- /dev/null +++ b/assets/blog.md @@ -0,0 +1,98 @@ +# Building long-horizon SWE environments on Hugging Face: Frontier SWE × OpenEnv + +**By the-thing**: we packaged and adapted 4 [FrontierSWE](https://www.frontierswe.com/) tasks as [OpenEnv](https://github.com/rycerzes/OpenEnv)-shaped services, pushed them to **Hugging Face Spaces**, and ran an **offline RL-style** training loop with public **datasets**, **Trackio** metrics, and a trainer Space. + +--- + +## TL;DR + +- **Four Dockerized environments** (notebook compression, Postgres wire adapter on SQLite, dependent type checker, libexpat → x86-64 asm) with a **shared Gym-style API** and **MCP** tools for planning and submission. +- **Custom harness adapter** built on top of OpenEnv harness work ([meta-pytorch/OpenEnv PR #389](https://github.com/meta-pytorch/OpenEnv/pull/389) and RFC005), then forked and extended in [`rycerzes/OpenEnv` on `feature/pi-harness-adapter`](https://github.com/rycerzes/OpenEnv/commits/feature/pi-harness-adapter/). +- **Composite rubric**: gates → L1 (tests / `reward.json` / regex ratios) → optional LLM layers → **episode reward** you can log and filter on for training. +- **Offline pipeline**: trajectories on the Hub → hindsight scoring (SGLang) → HCAPO-style dataset → **LoRA fine-tune** on a GPU Space, with **Trackio** curves for loss, LR, and gradient norms. + +**Try it:** [frontier-swe-postgres](https://huggingface.co/spaces/rycerzes/frontier-swe-postgres) · [frontier-swe-notebook](https://huggingface.co/spaces/rycerzes/frontier-swe-notebook) · [frontier-swe-type-checker](https://huggingface.co/spaces/rycerzes/frontier-swe-type-checker) · [frontier-swe-libexpat-to-x86asm](https://huggingface.co/spaces/rycerzes/frontier-swe-libexpat-to-x86asm) · [source on GitHub](https://github.com/3xcaffeine/frontier-swe-openenv) + +--- + +## 1. Environment innovation - why this setup is hard (and worth it) + +Classic coding benchmarks often score a single patch. **Long-horizon software engineering** is different: the agent has to **plan**, **edit a real workspace**, **call tools**, and **submit** work over many steps-closer to how people ship systems than to a one-shot fix. + +**What we built on top of that idea** + +We did not reinvent the underlying FrontierSWE task specs; we **re-homed** them inside a **uniform environment contract**: + +That includes a **custom harness adapter** layer we built on top of [meta-pytorch/OpenEnv PR #389](https://github.com/meta-pytorch/OpenEnv/pull/389) and RFC005, then maintained and updated in our fork: [`rycerzes/OpenEnv` `feature/pi-harness-adapter`](https://github.com/rycerzes/OpenEnv/tree/feature/pi-harness-adapter/). + +| Piece | What it does for the agent | +| --- | --- | +| **HTTP control** | `reset` / `step` / `state` / `health` - same shape every task, so harnesses and demos do not fork per domain. Maintaining the `openenv` specs | +| **MCP tools** | `submit_plan`, `submit_subtask`, `get_status`, `advance` - forces **explicit decomposition** and **scored subtasks**, not a single anonymous blob of edits. | +| **Multi-layer rubric** | **Gates** catch broken builds or missing artifacts early; **L1** is task-native (wire compat tests, notebook round-trips, type-checker scores, assembly benchmarks); **L2/L3** optionally add LLM code and plan review when grader env vars are set; **episode reward** blends plan quality, frozen subtask scores, completion, and tool usage. | + +That combination is deliberately **stressful** in a good way: the agent must **coordinate** (plan → execute → advance), **respect verifier reality** (hidden tests, anti-cheat), and **earn** a dense scalar at the end of an episode that can run on the order of **45–90+ minutes** per run-so the environment is **challenging**, **creative** in how it composes rubrics, and **meaningful** for measuring behavior beyond single-turn chat. + +--- + +## 2. The problem, the box, and what the agent actually does + +**Problem.** Training or evaluating agents on real long-horizon SWE needs a **repeatable service**: same ports, same instructions, same scoring, same tool surface-whether you run locally, in CI, or on the Hub. + +**Our box.** **frontier-swe-openenv** is a small monorepo: `tasks//` holds instructions and verifiers (what “correct” means operationally); `frontier_swe_env/` holds the **FastAPI** server, shared rubrics, and **TaskConfig** (how to invoke those verifiers inside the image); `spaces/` holds thin **Space** definitions synced from `main` after images build. + +**Agent behavior (easy to follow for a demo).** + +1. Connect (WebSocket client or baseline script). +2. `reset` → read observation / phase. +3. Loop: natural language or tool use → `step` → optional MCP calls to **submit a plan**, run **L1+L2** on a **subtask**, **advance** when satisfied. +4. Episode ends with a **terminal episode reward** and subtask history you can log. + +For a **concrete walkthrough without writing your own client**, the repo ships [`scripts/run_baseline.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/scripts/run_baseline.py): point it at `http://localhost:8000` with a task container running, and you get a full **reset → step** episode over the wire-good for recordings and “here is one turn of the loop” explanations. + +--- + +## 3. Observable training progress - rewards, curves + +Long episodes make **online** RL on the live env impractical at scale, so we invested in **offline** learning: **collect once**, **score offline**, **fine-tune**, **log everything**. + +**Public artifacts (HF-native story)** + +| Artifact | Link | Role in the demo | +| --- | --- | --- | +| Raw trajectories (pg-01, Qwen 3.6 27B) | [`rycerzes/fswe-pg-01-traj-q36-27b`](https://huggingface.co/datasets/rycerzes/fswe-pg-01-traj-q36-27b) | Shows **what** we logged per episode (`result.json`, sessions, logs, hindsight when present). | +| HCAPO training JSONL | [`rycerzes/fswe-hcapo-pg-01-trajectories`](https://huggingface.co/datasets/rycerzes/fswe-hcapo-pg-01-trajectories) | **Step-level advantages** paired with messages for supervised fine-tuning. | +| Trackio dashboard | [`rycerzes/trackio`](https://huggingface.co/spaces/rycerzes/trackio) | **Observable** loss, epoch, learning rate, gradient norm, global step. | + +On a **3 epoch / ~18 optimizer step** reference run (Space-backed trainer), the root README documents what we see in Trackio: **loss** trending down on the order of **~25%** over the plotted window (smoothed), **epoch** progressing toward **~2.7**, **LR** warmup-then-decay, **gradient norms** staying in a moderate band-i.e. a **sanity fine-tune** where optimization looks stable, not a mystery box. + +We also ship a **static dashboard figure** in-repo for slides and blog embeds: [`assets/training-trackio-dashboard.png`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/assets/training-trackio-dashboard.png). + +**Before / after.** The cleanest **before/after** we surface in tooling today is **training loss and optimization metrics** on the HCAPO dataset, plus **episode-level rewards inside collected trajectories** for analysis. A live **A/B rollout score** on the full Docker env after LoRA is the natural next chapter for the demo-and the pipeline is set up so you can **regenerate trajectories** with the adapted policy and compare distributions. For hackathon judging, the **curves + public datasets + reproducible launch script** are the evidence chain we stand behind *right now*. + +--- + +## 4. Reward logic and training pipeline - coherent signal end to end + +**Episode reward (macro).** The scalar \(R\) matches [`EpisodeRubric`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/frontier_swe_env/rubrics/episode_rubric.py): weighted **plan score**, mean **frozen subtask** scores, **completion**, and **tool density**-clipped into **[0, 1]** for filtering (e.g. `--min-reward 0.05` in the dataset builder). + +**L1 (micro, task-specific).** Each task implements its own verifier output: **regex ratio** on test totals (Postgres), **`reward_json`** fields (notebook), or **`reward_json_score`** with anchors (type checker, libexpat). Same server code paths; different physics. + +**Training path (why it should move policy behavior).** + +1. [`collect_trajectories.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/scripts/collect_trajectories.py) - rollouts into `trajectories/episode_NNN/`. +2. [`backfill_rewards.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/scripts/backfill_rewards.py) - repair missing `episode_reward` when needed. +3. [`compute_hindsight_scores.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/scripts/compute_hindsight_scores.py) - SGLang `/generate` with bounded logprob windows (memory-safe), MCP-aware **step → subtask** mapping, hindsight \(Q^H\) and smoothing. +4. [`build_hcapo_dataset.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/scripts/build_hcapo_dataset.py) - GRPO-style macro advantages + normalized hindsight micro advantages → **JSONL** with **per-step weights**. +5. [`train_hcapo.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/training/train_hcapo.py) + [`launch_hf_space.sh`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/scripts/launch_hf_space.sh) - **weighted CE on assistant tokens** (chunked forward for large models), Trackio reporting. + +Coherent design is means that environment reward defines **which episodes matter**; hindsight defines **which tokens inside those episodes** get gradient; the trainer respects **assistant masks** and **step weights** so the update is not “one scalar smeared across the whole transcript.” Details and equations live in [`training/README.md`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/training/README.md) + +--- + +## Where to go next + +- **Run a Space** from the TL;DR links and narrate **one** subtask submission end to end. +- **Open Trackio** to the named run and zoom the **loss / LR** panel while you talk through the pipeline slide. +- **Clone the repo**, `uv sync`, and use **`./scripts/launch_hf_space.sh`** when you want the full HF training path on your own account. + diff --git a/assets/training-trackio-dashboard.png b/assets/training-trackio-dashboard.png new file mode 100644 index 0000000..f62961c Binary files /dev/null and b/assets/training-trackio-dashboard.png differ diff --git a/spaces/libexpat-to-x86asm/README.md b/spaces/libexpat-to-x86asm/README.md index a13bd4e..f249168 100644 --- a/spaces/libexpat-to-x86asm/README.md +++ b/spaces/libexpat-to-x86asm/README.md @@ -10,11 +10,82 @@ pinned: false # Frontier SWE — libexpat to x86-64 Assembly -OpenEnv-shaped FastAPI service hosting the libexpat-to-x86asm task. +OpenEnv-shaped **FastAPI** service for the **libexpat-to-x86asm** task: reimplement **libexpat 2.6.4** in **x86-64 assembly**, producing `/app/asm-port/libexpat.so` with the **expat C ABI**. The verifier compares against reference C libexpat, runs upstream tests and benchmarks, and writes `/logs/verifier/reward.json` (correctness and performance blend; hard fail to `0.0` on anti-cheat or missing `.so`). -- Source repo: -- Container image: `ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-libexpat-to-x86asm:latest` -- Health: `/health` -- MCP JSON-RPC: `/mcp` +## The task in depth -Deployed automatically from `main` via the `sync-hf-spaces` workflow. +The agent’s deliverable is a **shared library** built from **`.s` / `.asm`** sources under **`/app/asm-port/`**, exporting symbols such as **`XML_ParserCreate`** so the upstream **expat** test suite can link against it. There is **no C compiler** in the agent environment; the verifier may compile reference C code for comparison. Scoring combines **weighted test pass rates** with **benchmark timing ratios** (reference time vs agent time) into a single **`score`** in **`reward.json`**, with explicit anti-cheat checks (no `dlopen` of system libexpat, no smuggled C core files, etc.). The server treats that file in **`reward_json_score`** mode with anchors **`(0.0, 1.0)`**. + +## How this maps to the monorepo + +- **`tasks/libexpat-to-x86asm/`** — Instructions, encrypted or staged toolchain bundles as designed, **`tests/`** with **`test.sh`**, **`compute_reward.py`**, and benchmark XML generators. +- **`frontier_swe_env/tasks/libexpat_to_x86asm.py`** — **`TaskConfig`**: workspace **`/app/asm-port`**, gate script, verifier command, JSON path and anchors, CPU/memory hints, and judge context strings. +- **`spaces/libexpat-to-x86asm/`** — This Space and manifest. + +See [**Task assets and runtime configuration**](https://github.com/3xcaffeine/frontier-swe-openenv#task-assets-and-runtime-configuration) in the root README. + +## Features + +- **Assembly port workspace**: `/app/asm-port` with staged toolchain and bundles (see gate checks in manifest). +- **Structured L1**: Normalised score from `reward.json`; gates for writable workspace, headers, `nasm` / `as` / `ld`, and staged artifacts. +- **LLM rubric layers**: L2 code review and L3 plan review when grader env vars are set. +- **MCP tools**: `submit_plan`, `submit_subtask`, `get_status`, `advance`. + +## HTTP API + +| Endpoint | Notes | +| --- | --- | +| `GET /health` | Liveness. | +| `POST /reset`, `POST /step`, `GET /state` | OpenEnv Gym-style control. | +| `POST /mcp` | OpenEnv JSON-RPC MCP. | +| `/tools/mcp` | FastMCP Streamable HTTP. | + +## Quick start (Docker) + +```bash +docker run --rm -p 8000:8000 \ + ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-libexpat-to-x86asm:latest +``` + +This task is CPU- and memory-sensitive; the manifest requests **4 CPUs** and **8192 MiB** where the platform allows. + +```bash +docker run --rm -p 8000:8000 \ + -e FSWE_GRADER_MODEL=... \ + -e FSWE_GRADER_API_URL=... \ + -e FSWE_GRADER_API_KEY=... \ + ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-libexpat-to-x86asm:latest +``` + +## Python client (host) + +```python +import asyncio +from frontier_swe_env.client import FrontierSweEnv +from frontier_swe_env.models import FrontierSweAction + + +async def main(): + client = FrontierSweEnv(base_url="http://localhost:8000") + await client.connect() + try: + await client.reset() + await client.step(FrontierSweAction(message="Continue the assembly port.")) + finally: + await client.close() + + +asyncio.run(main()) +``` + +## Task manifest + +[`openenv.yaml`](openenv.yaml) — episode timeout, L1 timeout, reward field anchors, rubric layers, metrics. Task sources: `tasks/libexpat-to-x86asm/`. + +## Deployment + +- **Image**: `ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-libexpat-to-x86asm:latest` +- **Source**: [3xcaffeine/frontier-swe-openenv](https://github.com/3xcaffeine/frontier-swe-openenv) +- **Sync**: HF Space updated from `main` after successful GHCR build. + +Benchmark context: [FrontierSWE — libexpat to x86-64 assembly](https://www.frontierswe.com/libexpat-to-x86asm). diff --git a/spaces/notebook/README.md b/spaces/notebook/README.md index 9979b1f..2256687 100644 --- a/spaces/notebook/README.md +++ b/spaces/notebook/README.md @@ -10,12 +10,84 @@ pinned: false # Frontier SWE — Notebook Compression -OpenEnv-shaped FastAPI service hosting the notebook-compression task. +OpenEnv-shaped **FastAPI** service for the **notebook-compression** task: build a fit / compress / decompress pipeline for Jupyter notebooks inside a Linux workspace, with multi-layer rubric scoring and a structured `reward.json` written by the verifier. -- Source repo: -- Container image: `ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-notebook:latest` -- Health: `/health` -- MCP JSON-RPC: `/mcp` +## The task in depth -Deployed automatically from `main` via the `sync-hf-spaces` workflow. +The agent needs to ship an executable **`/app/run`** with three subcommands: **`fit`** (train or build artifacts from a **visible** corpus only), **`compress`**, and **`decompress`**. At scoring time the agent does not see the hidden corpus: the verifier checks **byte-for-byte** recovery of every notebook file. Compression quality is summarised as a geometric mean of size ratios; hard failures (round-trip mismatch, crashes, invalid `reward.json` status) collapse the L1 signal to zero. That logic lives in the repo under [`tasks/notebook-compression/tests/`](https://github.com/3xcaffeine/frontier-swe-openenv/tree/main/tasks/notebook-compression/tests) (shell driver plus [`compute_reward.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/tasks/notebook-compression/tests/compute_reward.py)), which writes **`/logs/verifier/reward.json`** for the server to read. +## How this maps to the monorepo + +- **`tasks/notebook-compression/`** — Authoritative instructions, verifier, and reward computation; copied into the image (for example **`/opt/verifier/test.sh`** and data mounts). +- **`frontier_swe_env/tasks/notebook_compression.py`** — Registers **`TaskConfig`** with `l1_score_mode="reward_json"`, the container test command, long L1 timeouts, gate path, and prose for L2/L3 judges. The running server selects it when `FSWE_TASK_NAME` is `notebook` or `notebook-compression` (see [`__init__.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/frontier_swe_env/tasks/__init__.py)). +- **`spaces/notebook/`** — This Space: thin Dockerfile, this README, and **`openenv.yaml`** describing the same episode for Hugging Face and external tooling. + +For the full picture of how task directories and Python configs interact, see the root README section [**Task assets and runtime configuration**](https://github.com/3xcaffeine/frontier-swe-openenv#task-assets-and-runtime-configuration). + +## Features + +- **Long-horizon SWE**: Plan subtasks, edit code under the configured workspace, submit for scoring. +- **Composite rubric**: Shell gate checks → structured L1 from `/logs/verifier/reward.json` → optional LLM code review (L2) and plan review (L3) → weighted episode reward. +- **MCP tools**: `submit_plan`, `submit_subtask`, `get_status`, `advance` (same contract as other Frontier SWE Spaces). +- **Dual MCP transports**: OpenEnv `POST /mcp` and Streamable HTTP `/tools/mcp` for adapters. + +## HTTP API + +| Endpoint | Notes | +| --- | --- | +| `GET /health` | Liveness for orchestration and HF health checks. | +| `POST /reset`, `POST /step`, `GET /state` | OpenEnv Gym-style control. | +| `POST /mcp` | OpenEnv JSON-RPC MCP. | +| `/tools/mcp` | FastMCP Streamable HTTP (POST + GET/SSE). | + +## Quick start (Docker) + +```bash +docker run --rm -p 8000:8000 \ + ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-notebook:latest +``` + +Optional grader configuration for LLM rubric layers: + +```bash +docker run --rm -p 8000:8000 \ + -e FSWE_GRADER_MODEL=... \ + -e FSWE_GRADER_API_URL=... \ + -e FSWE_GRADER_API_KEY=... \ + ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-notebook:latest +``` + +## Python client (host) + +From the [source repository](https://github.com/3xcaffeine/frontier-swe-openenv), with dependencies installed: + +```python +import asyncio +from frontier_swe_env.client import FrontierSweEnv +from frontier_swe_env.models import FrontierSweAction + + +async def main(): + client = FrontierSweEnv(base_url="http://localhost:8000") + await client.connect() + try: + await client.reset() + await client.step(FrontierSweAction(message="Continue the task.")) + finally: + await client.close() + + +asyncio.run(main()) +``` + +## Task manifest + +OpenEnv metadata for judges and tooling: [`openenv.yaml`](openenv.yaml) in this Space (mirrors `spaces/notebook/openenv.yaml` in the GitHub repo). Task sources: `tasks/notebook-compression/`. + +## Deployment + +- **Image**: `ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-notebook:latest` +- **Source**: [3xcaffeine/frontier-swe-openenv](https://github.com/3xcaffeine/frontier-swe-openenv) +- **Sync**: Pushed from `main` by the repository’s HF Spaces sync workflow after GHCR builds succeed. + +Benchmark context: [FrontierSWE — Notebook compression](https://www.frontierswe.com/notebook-compression). diff --git a/spaces/postgres/README.md b/spaces/postgres/README.md index 944cbad..89705fb 100644 --- a/spaces/postgres/README.md +++ b/spaces/postgres/README.md @@ -8,13 +8,88 @@ app_port: 8000 pinned: false --- -# Frontier SWE — Postgres SQLite Wire Adapter +# Frontier SWE — Postgres / SQLite Wire Adapter -OpenEnv-shaped FastAPI service hosting the postgres-sqlite-wire-adapter task. +OpenEnv-shaped **FastAPI** service for the **postgres-sqlite-wire-adapter** task: implement a PostgreSQL wire-protocol-compatible server in **Zig** backed by **SQLite**, with gate checks, a graded test runner, and composite rubric scoring. -- Source repo: -- Container image: `ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-postgres:latest` -- Health: `/health` -- MCP JSON-RPC: `/mcp` +## The task in depth -Deployed automatically from `main` via the `sync-hf-spaces` workflow. +The workspace is **`/app/postgres-sqlite`**. The agent grows a Zig project that mimics enough **`postgres` / `pg_ctl` / `initdb`** behaviour and the **Frontend/Backend protocol** so that real PostgreSQL clients can connect and run a large scripted compatibility matrix. **L1** is driven by a visible test script whose stdout looks like **`Total: N/M passed`**; the shared rubric parses that as a pass ratio (see `l1_score_mode="ratio"`). Hidden or stronger checks can live alongside the same task pack under [`tasks/postgres-sqlite-wire-adapter/tests/`](https://github.com/3xcaffeine/frontier-swe-openenv/tree/main/tasks/postgres-sqlite-wire-adapter/tests). Unlike the JSON-heavy tasks, there is no requirement for `reward.json` unless you extend the verifier that way. + +## How this maps to the monorepo + +- **`tasks/postgres-sqlite-wire-adapter/`** — Stubs, instructions, **`pg_compat_test.sh`**, smoke tests, and hidden verifier assets copied into the image. +- **`frontier_swe_env/tasks/pg.py`** — **`TaskConfig`** for this task: Zig workspace path, **`bash /app/gate_checks.sh`**, **`PG_PORT=55432 bash /app/pg_compat_test.sh`** as the L1 command, regex pattern for totals, timeouts, and judge-facing descriptions. +- **`spaces/postgres/`** — Space wrapper and **`openenv.yaml`** aligned with the same episode. + +More detail: [**Task assets and runtime configuration**](https://github.com/3xcaffeine/frontier-swe-openenv#task-assets-and-runtime-configuration) in the root README. + +## Features + +- **Systems programming focus**: Zig workspace under `/app/postgres-sqlite`, verifier and hidden tests shipped in the image. +- **L1 scoring**: Regex ratio over test runner output (`Total: N/M passed`) plus gate script. +- **LLM-assisted layers**: L2 code review and L3 plan review when grader env vars are set. +- **MCP tools**: `submit_plan`, `submit_subtask`, `get_status`, `advance`. + +## HTTP API + +| Endpoint | Notes | +| --- | --- | +| `GET /health` | Liveness. | +| `POST /reset`, `POST /step`, `GET /state` | OpenEnv Gym-style control. | +| `POST /mcp` | OpenEnv JSON-RPC MCP. | +| `/tools/mcp` | FastMCP Streamable HTTP. | + +## Quick start (Docker) + +```bash +docker run --rm -p 8000:8000 \ + ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-postgres:latest +``` + +With grader API for full rubric: + +```bash +docker run --rm -p 8000:8000 \ + -e FSWE_GRADER_MODEL=... \ + -e FSWE_GRADER_API_URL=... \ + -e FSWE_GRADER_API_KEY=... \ + ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-postgres:latest +``` + +## Baseline script + +The repo ships [`scripts/run_baseline.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/scripts/run_baseline.py) for a full WebSocket episode against a running container (defaults to `http://localhost:8000`). + +## Python client (host) + +```python +import asyncio +from frontier_swe_env.client import FrontierSweEnv +from frontier_swe_env.models import FrontierSweAction + + +async def main(): + client = FrontierSweEnv(base_url="http://localhost:8000") + await client.connect() + try: + await client.reset() + await client.step(FrontierSweAction(message="Implement the next milestone.")) + finally: + await client.close() + + +asyncio.run(main()) +``` + +## Task manifest + +[`openenv.yaml`](openenv.yaml) — workspace, timeouts, rubric layers, and metrics. Task sources: `tasks/postgres-sqlite-wire-adapter/`. + +## Deployment + +- **Image**: `ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-postgres:latest` +- **Source**: [3xcaffeine/frontier-swe-openenv](https://github.com/3xcaffeine/frontier-swe-openenv) +- **Sync**: HF Space payload is assembled from this directory on `main` after GHCR builds. + +Benchmark context: [FrontierSWE — PostgreSQL on SQLite](https://www.frontierswe.com/postgres-sqlite-wire-adapter). diff --git a/spaces/type-checker/README.md b/spaces/type-checker/README.md index 2e228d6..7c86439 100644 --- a/spaces/type-checker/README.md +++ b/spaces/type-checker/README.md @@ -10,11 +10,84 @@ pinned: false # Frontier SWE — Dependent Type Checker -OpenEnv-shaped FastAPI service hosting the dependent-type-checker task. +OpenEnv-shaped **FastAPI** service for the **dependent-type-checker** task: implement a Martin-Löf-style dependently typed language **type checker** in **Rust** (`cargo build --release`), scored on correctness gates and speedup versus a reference implementation via `/logs/verifier/reward.json`. -- Source repo: -- Container image: `ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-dependent-type-checker:latest` -- Health: `/health` -- MCP JSON-RPC: `/mcp` +## The task in depth -Deployed automatically from `main` via the `sync-hf-spaces` workflow. +The agent edits **`/app/type-checker/`** (Cargo project) and must produce a release binary that type-checks `.sexp` programs for a language with dependent functions, inductive families, cumulativity, and related features spelled out in **`instruction.md`**. The verifier (**`bash /opt/verifier/test.sh`**) enforces anti-cheat rules, checks accept/reject corpus rates, then measures speedups vs a reference implementation on fixed workloads. It writes **`/logs/verifier/reward.json`** with a numeric **`score`** and optional **`additional_data.reason`** on hard fail. Python config uses **`l1_score_mode="reward_json_score"`** with anchors **`(0.0, 2.0)`** so the server normalises that scalar into the shared \([0,1]\) L1 channel. + +## How this maps to the monorepo + +- **`tasks/dependent-type-checker/`** — Full formal spec, corpora, reference implementation pieces, and verifier scripts under **`tests/`**. +- **`frontier_swe_env/tasks/dependent_type_checker.py`** — Registers **`TaskConfig`** (`dependent-type-checker` / alias `type-checker`), build command, verifier timeout, JSON field names, and training vs demo instruction loading (demo can pull [`instruction.md`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/tasks/dependent-type-checker/instruction.md) from the repo when present on the host). +- **`spaces/type-checker/`** — This Space; GHCR image name uses **`frontier-swe-dependent-type-checker`**. + +Architecture overview: [**Task assets and runtime configuration**](https://github.com/3xcaffeine/frontier-swe-openenv#task-assets-and-runtime-configuration). + +## Features + +- **Rust workspace**: `/app/type-checker` with release binary expected by the verifier. +- **Structured L1**: Score from `reward.json` (normalised with configured anchors, hard-fail signals documented in manifest). +- **Gate checks**: Workspace, `Cargo.toml`, toolchain, and successful release build. +- **MCP tools**: `submit_plan`, `submit_subtask`, `get_status`, `advance`. + +## HTTP API + +| Endpoint | Notes | +| --- | --- | +| `GET /health` | Liveness. | +| `POST /reset`, `POST /step`, `GET /state` | OpenEnv Gym-style control. | +| `POST /mcp` | OpenEnv JSON-RPC MCP. | +| `/tools/mcp` | FastMCP Streamable HTTP. | + +## Quick start (Docker) + +The GHCR image name uses `dependent-type-checker` (the workflow task id), while this Hugging Face Space repo id uses `type-checker`. + +```bash +docker run --rm -p 8000:8000 \ + ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-dependent-type-checker:latest +``` + +With grader API: + +```bash +docker run --rm -p 8000:8000 \ + -e FSWE_GRADER_MODEL=... \ + -e FSWE_GRADER_API_URL=... \ + -e FSWE_GRADER_API_KEY=... \ + ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-dependent-type-checker:latest +``` + +## Python client (host) + +```python +import asyncio +from frontier_swe_env.client import FrontierSweEnv +from frontier_swe_env.models import FrontierSweAction + + +async def main(): + client = FrontierSweEnv(base_url="http://localhost:8000") + await client.connect() + try: + await client.reset() + await client.step(FrontierSweAction(message="Work on the type checker.")) + finally: + await client.close() + + +asyncio.run(main()) +``` + +## Task manifest + +[`openenv.yaml`](openenv.yaml) — build command, L1 timeouts, reward anchors, rubric. Task sources: `tasks/dependent-type-checker/`. + +## Deployment + +- **Image**: `ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-dependent-type-checker:latest` +- **Source**: [3xcaffeine/frontier-swe-openenv](https://github.com/3xcaffeine/frontier-swe-openenv) +- **Sync**: Deployed from `main` via the repository HF Spaces workflow. + +Benchmark context: [FrontierSWE — Dependent type checker](https://www.frontierswe.com/dependent-type-checker). diff --git a/training/README.md b/training/README.md new file mode 100644 index 0000000..5572ba8 --- /dev/null +++ b/training/README.md @@ -0,0 +1,453 @@ +# HCAPO training pipeline + +This document describes the **HCAPO-inspired** training flow used for Frontier SWE trajectory fine-tuning: how **episode rewards** are defined, how **hindsight** scores become **step advantages**, what the **training dataset** contains, and what **training / runtime** adjustments were made for **Qwen** models and **Hugging Face GPU** Spaces. + +For a short end-to-end recipe (datasets on the Hub, Trackio, launch commands), see the **Training** section in the [root README](../README.md). + +--- + +## Design rationale + +### Why not online RL (e.g. GRPO on the live environment)? + +Episodes often last on the order of **45–90+ minutes**. Online methods that need **many fresh rollouts per policy update** are **impractical**: orchestration, verifier time, and failures dominate before the optimiser sees enough data. We **collect trajectories once**, score them **offline**, build a **static** dataset, then fine-tune. + +### Why not plain DPO or scalar reward-weighted SFT? + +- **DPO** wants preference-style contrasts; our logs are **single** multi-turn trajectories with tools, not natural pairs per step. +- **Scalar reward-weighted SFT** applies **one weight per episode** and does not say **which assistant turns** helped. **HCAPO-style** credit assigns **macro** (trajectory) and **micro** (hindsight) signals per step. + +### Relation to the [HCAPO paper](https://arxiv.org/abs/2603.08754) (2603.08754) + +There is **no official end-to-end** public repo for the full paper stack (ALFWorld + WebShop + Search QA + multi-GPU online GRPO + generative verification). **Appendix B** of the [HTML version](https://arxiv.org/html/2603.08754v1) is essentially runnable pseudocode (rollouts, \(\pi_{\text{hind}}\), \(\rho_t\), composite advantage, PPO-style update). Helpful forks: [Awesome-GRPO](https://github.com/GITrans/Awesome-GRPO), [direct-preference-optimization](https://github.com/eric-mitchell/direct-preference-optimization) (PPO/GRPO helpers). + +| Paper (conceptual) | This repo | +| --- | --- | +| Online GRPO-style RL | **Offline** pipeline: [`collect_trajectories.py`](../scripts/collect_trajectories.py) → hindsight → [`build_hcapo_dataset.py`](../scripts/build_hcapo_dataset.py) → [`train_hcapo.py`](train_hcapo.py) | +| Terminal reward emphasis | **Dense** `plan_score` + `frozen_scores` in prompts and in \(Q^H\) when dense mode is on ([`compute_hindsight_scores.py`](../scripts/compute_hindsight_scores.py)) | +| Generic step alignment | **MCP tool boundaries**: [`map_steps_to_subtasks()`](../scripts/compute_hindsight_scores.py) unwraps outer `mcp` calls, parses `submit_plan` / `advance`, assigns **phase** and **subtask_id** | +| PPO-clipped policy gradient | **Step-weighted SFT**: combined advantages → JSONL → weighted CE in `HCAPOTrainer` | +| Generic logprob API | **SGLang** native `/generate`, `logprob_start_len`, bounded action scoring, retries ([`score_step_logprobs()`](../scripts/compute_hindsight_scores.py)) | + +--- + +## Pipeline overview + +1. **Collect trajectories** — [`scripts/collect_trajectories.py`](../scripts/collect_trajectories.py). Each `trajectories/episode_NNN/` holds `result.json`, `pi_session.jsonl`, logs, and later `hindsight_scores.json`. + +2. **Backfill or read episode reward** — `result.json` stores final reward and subtask scores. If an episode does not reach `DONE`, [`scripts/backfill_rewards.py`](../scripts/backfill_rewards.py) (and collection-time logic in `collect_trajectories.py`) can fill **`episode_reward`** from captured state. + +3. **Compute hindsight scores** — [`scripts/compute_hindsight_scores.py`](../scripts/compute_hindsight_scores.py) calls SGLang’s native **`/generate`** (via `httpx`) to score original assistant actions under hindsight context; writes **`hindsight_scores.json`**. + +4. **Build and train** — [`scripts/build_hcapo_dataset.py`](../scripts/build_hcapo_dataset.py) merges trajectory-level advantages with step-level hindsight and writes `datasets/hcapo_train.jsonl`. [`train_hcapo.py`](train_hcapo.py) runs weighted SFT (Unsloth + TRL). [`launch_hf_space.sh`](../scripts/launch_hf_space.sh) wraps HF Space / dataset upload flows. + +--- + +## Episode reward + +The scalar **\(R\)** stored in trajectories and used by the dataset builder matches the **episode rubric** in code ([`EpisodeRubric.compute`](../frontier_swe_env/rubrics/episode_rubric.py)): + +```text +R = plan_weight * plan_score + + subtask_weight * subtask_mean + + completion_weight * completion + + tool_weight * tool_density +``` + +With default weights (`TaskConfig`): **0.25 / 0.60 / 0.10 / 0.05**: + +```text +plan_count = max(len(plan), 1) +subtask_mean = mean(frozen subtask scores, padded with 0.0 to plan_count) +completion = min(number_of_frozen_scores / plan_count, 1.0) +tool_density = min(tool_call_count / (5 * plan_count), 1.0) +``` + +**\(R\)** is treated as lying in **[0, 1]** for reporting (and filtering with `--min-reward`). + +Planning-only episodes can still get a small **\(R\)** via **`tool_density`**. Under **dense** hindsight scoring, steps often still carry **\(r_t = 0\)** until there is a nonzero **`plan_score`** or **`frozen_scores[subtask_id]`**, so they contribute little after advantage clipping. + +--- + +## Step-to-subtask mapping + +[`map_steps_to_subtasks()`](../scripts/compute_hindsight_scores.py) assigns each **assistant** message: + +- **Planning** — until a **`submit_plan`** tool call succeeds (JSON tool response, no error prefix). +- **Executing** — after a successful plan; **`advance`** (on success) moves the current subtask index. + +Per-step metadata includes: + +```json +{ + "phase": "executing", + "subtask_id": "S2", + "subtask_reward": 0.13 +} +``` + +**`subtask_reward`** is **`plan_score`** in planning, else **`frozen_scores[subtask_id]`** in executing. + +**Outer `mcp` wrapper:** Pi/OpenEnv may emit tool calls under an outer function name `mcp` with nested JSON naming the real tool (e.g. `openenv_submit_plan`). [`_extract_effective_tool_names()`](../scripts/compute_hindsight_scores.py) unwraps that so transitions key off **`submit_plan`**, **`advance`**, etc. + +--- + +## Hindsight prompt + +For each assistant action, the scorer appends a block (see `HINDSIGHT_TEMPLATE` in [`compute_hindsight_scores.py`](../scripts/compute_hindsight_scores.py)) including: + +```text +Final reward +Phase reached +Plan score +Subtask scores (summary) +Subtasks completed / plan count +Current subtask +Current subtask score +``` + +That text is **post-hoc** (not visible during the original rollout). The scoring model then receives a forward request whose labels are used only to read **input-token logprobs** for the **original** assistant tokens. + +--- + +## Hindsight scoring via SGLang (`/generate`) + +The script uses SGLang’s native **`POST .../generate`** with **`httpx.AsyncClient`**, not the OpenAI-compatible chat-completions path with `echo` + `logprobs` on the **full** prompt (which can force huge logits tensors and **OOM the server**). + +Payload highlights: + +```text +return_logprob = true +logprob_start_len = prefix_len + skipped_action_tokens +``` + +Here **`skipped_action_tokens`** trims the start of the **action** so only the last **`min(action_len, max_logprob_tokens)`** action tokens are scored—reducing work from roughly **`seq_len × vocab`** to **`max_logprob_tokens × vocab`** for the logprob slice. + +**CLI defaults** (see argparse in [`compute_hindsight_scores.py`](../scripts/compute_hindsight_scores.py)): + +```text +--concurrency 1 +--max-context 32768 +--max-logprob-tokens 2048 # increase (e.g. 4096) for longer actions if the server allows +--batch-size 4 +``` + +**Retries:** exponential backoff on 500 / 502 / 503 / 504 / 204 and OOM-like error strings (`_MAX_RETRIES`, `_RETRY_BASE_DELAY`). + +--- + +## Hindsight scoring formulae + +Let **`mean_logprob_t`** be the mean log-probability of the **scored** action token suffix under the hindsight-augmented prefix. + +```text +pi_hind_t = exp(mean_logprob_t / T_temp) # default T_temp = 5.0 +pi_mean = mean_t(pi_hind_t) +rho_raw_t = pi_hind_t / pi_mean +rho_t = clip(rho_raw_t, c_min, c_max) # defaults 0.8, 1.2 +``` + +**Dense rewards (default):** + +```text +Q_H_t = rho_t * gamma^(group_end(t) - t) * r_t +``` + +- **`r_t`**: dense step reward (`subtask_reward` above). +- **`group_end(t)`**: last step index in the same **subtask id** (or planning phase bucket). + +**Terminal fallback** (`--no-dense-rewards`): + +```text +Q_H_t = rho_t * gamma^(T - 1 - t) * R +``` + +**Temporal smoothing** (`--alpha`, default `0.5`): + +```text +Q_smooth_(T-1) = Q_H_(T-1) +Q_smooth_t = alpha * Q_H_t + (1 - alpha) * Q_smooth_(t+1) # backward pass +``` + +[`build_hcapo_dataset.py`](../scripts/build_hcapo_dataset.py) uses **`q_h_smoothed`** unless **`--no-smooth`**. + +--- + +## HCAPO advantage construction + +Episodes must pass **`--min-reward`** and contain **`hindsight_scores.json`**. + +### Trajectory (macro) advantage + +```text +A_grpo_i = (R_i - mean(R)) / std(R) +``` + +If **`std(R) == 0`**, the code uses **`1.0`** instead ([`compute_grpo_advantages()`](../scripts/build_hcapo_dataset.py)). + +### Hindsight (micro) advantage + +Over **all kept steps** in the batch: + +```text +mu_h = mean(q_h_smoothed_t) +sigma_h = std(q_h_smoothed_t) +A_micro_t = (q_h_smoothed_t - mu_h) / sigma_h +``` + +**Do-no-harm:** if **`A_grpo_i > 0`**, then **`A_micro_t ← max(A_micro_t, 0)`**. + +### Combined advantage and JSONL weights + +```text +A_hcapo_t = A_grpo_i + omega * A_micro_t # default omega = 1.0 +w_t_raw = max(A_hcapo_t, 0) +w_t = w_t_raw / mean(w_t_raw | w_t_raw > 0) +``` + +Rows where **all** **`w_t`** are zero are dropped. + +--- + +## Dataset format + +`datasets/hcapo_train.jsonl` — one JSON object per episode (example shape): + +```json +{ + "messages": [...], + "step_advantages": [1.23, 0.87, 1.45], + "step_message_indices": [1, 4, 7], + "_episode_id": 12, + "_reward": 0.4058, + "_grpo_advantage": 0.91, + "_num_steps": 67 +} +``` + +Example summary from a **pg-01** run (`hcapo_summary.json` after build): + +```text +total_episodes_loaded = 20 +episodes_in_dataset = 14 +total_steps = 1414 +nonzero_steps = 1391 +min_reward = 0.05 +omega = 1.0 +use_smoothed = true +``` + +(Exact counts depend on your local `trajectories/` and flags.) + +--- + +## Training loss + +**HCAPOTrainer** ([`train_hcapo.py`](train_hcapo.py)) applies **step-weighted** cross-entropy on **assistant** tokens only. Conceptually, for token position **`j`** belonging to assistant step **`t`**: + +```text +CE_j = cross_entropy(logits_j, label_j) +weighted_loss = sum_j w_t(j) * CE_j / sum_j w_t(j) * mask_j +``` + +Only labels with supervision (and assistant spans) contribute; **`ignore_index = -100`** drops non-target positions. Long sequences are summed in **chunks** (e.g. 256 positions) inside **`compute_loss`** to cap peak memory. + +--- + +## Training adjustments (Qwen, Unsloth, HF) + +### Qwen 3.5 / 3.6 architecture and wrappers + +Many Qwen 3.x checkpoints use **`Qwen3_5ForConditionalGeneration`**: a multimodal module tree that still includes **`language_model`** + **`lm_head`** for text. With **PEFT / Unsloth**, you often get: + +```text +PeftModelForCausalLM + └── LoraModel + └── Qwen3_5ForConditionalGeneration + ├── model (Qwen3_5Model) + │ └── language_model ← text backbone for loss + └── lm_head +``` + +[`_get_backbone_and_lm_head()`](train_hcapo.py) unwraps **PeftModel → LoraModel → inner CausalLM**, then uses **`.model`** as the transformer backbone and follows **`.language_model`** when present so **`lm_head.in_features`** matches **hidden states**. + +Reported sizes (for sanity checks): + +```text +Qwen3.5-4B: hidden_size = 2560, vocab_size = 248320 +Qwen3.6-27B: hidden_size = 5120, vocab_size = 248320 +``` + +[`_remove_qwen_vision_mappings()`](train_hcapo.py) strips vision-related **`auto_map`** entries so Unsloth does not treat a text-only checkpoint as a vision pipeline. + +### Chat template and `assistant_masks` + +Transformers only fills **`assistant_masks`** when the Jinja template wraps assistant generations with: + +```jinja +{% generation %} +... +{% endgeneration %} +``` + +Qwen templates may omit this. The trainer **patches the tokenizer chat template in memory** (see [`_ensure_generation_chat_template()`](train_hcapo.py)) so **`apply_chat_template(..., return_assistant_tokens_mask=True)`** works in one pass—important for long Pi sessions. + +### Pre-tokenization vs `formatting_func` + +Unsloth’s SFT path often wants a **`formatting_func`** when there is no plain **`text`** column. We **pre-tokenize** rows to **`input_ids`** + **`assistant_masks`** + **`step_advantages`** so Unsloth can skip conversational re-formatting at train time. After that, **`assistant_only_loss`** is set **`False`** in **`SFTConfig`**; the **HCAPO collator** enforces assistant-only regions via masks. + +### HCAPO data collator + +[`_build_hcapo_data_collator()`](train_hcapo.py): + +1. Strips metadata columns before the base collator runs. +2. Uses **`assistant_masks`** so non-assistant positions are **`ignore_index`**. +3. Finds contiguous **assistant label spans** in **`labels`**. +4. Assigns each span the corresponding **`step_advantages`** entry. +5. Adds **`step_weights`** to the batch for **`HCAPOTrainer`**. + +If Unsloth swaps the collator during init, the trainer **re-applies** the HCAPO collator so **`step_weights`** are not dropped. + +### Chunked backbone + `lm_head` projection + +For **27B × long context**, a single **`model(**inputs)`** that returns full **`[batch, seq, vocab]`** logits can exceed **A100 80GB**. The custom **`compute_loss`** path: + +1. Runs the **text backbone** with **`use_cache=False`**. +2. Drops the large activations that are not needed for the next chunk. +3. Applies **`lm_head`** in **chunks** (default width **256** tokens). +4. Accumulates weighted CE numerator and denominator across chunks. + +Peak logits memory scales like **`O(chunk × vocab)`** instead of **`O(seq × vocab)`**. + +### Liger + +**`liger-kernel>=0.7.0`** is a project dependency. Fused kernels can still help **inside** transformer blocks during the backbone forward. The **custom** loss path does **not** call Liger’s fused CE for the final weighted loss (we need arbitrary **`step_weights`** per position). + +### Adapter vs merged weights + +Prefer saving the **LoRA adapter** (`save_merged_16bit: false` in config) to avoid multi‑tens‑of‑GB merged checkpoints. Load **base + adapter** at inference. + +### No QLoRA for the A100 Qwen 3.6 recipe + +The reference HF config keeps **`load_in_4bit: false`** for the 27B Space run so training stays on the **bf16 LoRA** path without 4-bit quant quirks on this stack. + +--- + +## Configurations + +Paths are wired in [`launch_hf_space.sh`](../scripts/launch_hf_space.sh) and copied in [`Dockerfile.train`](Dockerfile.train): + +| File | Role | +| --- | --- | +| [`hcapo_config_4090_q35_4b.json`](hcapo_config_4090_q35_4b.json) | Local **4090** smoke: **`Qwen/Qwen3.5-4B`**, **`max_seq_length` 1024**, **`num_train_epochs` 1**, **`per_device_train_batch_size` 1**, **`gradient_accumulation_steps` 8**, **`warmup_steps` 5**, **`load_in_4bit` false**. | +| [`hcapo_config_a100_q36_27b.json`](hcapo_config_a100_q36_27b.json) | **A100** HF recipe: **`Qwen/Qwen3.6-27B`**, **`max_seq_length` 16384**, **`num_train_epochs` 3**, **`per_device_train_batch_size` 1**, **`gradient_accumulation_steps` 4**, **`warmup_steps` 2**, **`load_in_4bit` false**, **`save_merged_16bit` false**. | + +**Step budget:** with **`per_device_train_batch_size = 1`** and **`gradient_accumulation_steps = 4`**, Hugging Face / TRL advance the optimiser roughly **`len(train_dataloader) // 4`** times per epoch (exact rounding depends on version and **`drop_last`**). For **~14** JSONL rows that is on the order of **three** updates per epoch, so **three epochs → ~nine** global steps unless **`--max-steps`** or a larger dataset changes the schedule. If Trackio shows a different total (e.g. **18**), compare the **`max_steps`** / dataset size / launch overrides for that run. + +--- + +## HF Spaces behaviour + +### Health check (port **7860**) + +Spaces expect HTTP on **7860** within the startup window. [`Dockerfile.train`](Dockerfile.train) starts a tiny background server before training: + +```bash +uv run python -m http.server 7860 &>/dev/null & +``` + +### Container lifecycle + +Training should **not** `exec` into the trainer as **PID 1**: when the process exits, the container dies and the Space may restart. The image keeps **bash** as PID **1**, runs training, then **`sleep infinity`** so the Space stays up until you pause or delete it. + +```bash +huggingface-cli space pause / +``` + +### Dependencies + +Training extras live under **`[project.optional-dependencies] training`** in [`pyproject.toml`](../pyproject.toml). The training image installs with: + +```text +uv sync --frozen --no-dev --extra training +``` + +### Naming (example) + +| Artefact | Example id | +| --- | --- | +| Dataset repo | `fswe-hcapo-pg-01-trajectories` | +| Adapter output repo | `fswe-hcapo-pg-01-qwen36-27b` | +| Trackio Space | `/fswe-hcapo-pg-01-monitor` | +| Trackio project | `fswe-hcapo-pg-01` | +| Run name | `fswe-hcapo-pg-01-qwen36-27b` | + +Set **`report_to = trackio`**, **`TRACKIO_SPACE_ID`**, **`TRACKIO_PROJECT_NAME`**, and optionally the compatibility aliases **`TRACKIO_SPACE`**, **`TRACKIO_PROJECT`** (see [`train_hcapo.py`](train_hcapo.py) argparse / env handling). + +--- + +## Typical commands + +```bash +uv run python scripts/build_hcapo_dataset.py \ + --input-dir trajectories \ + --output-dir datasets \ + --min-reward 0.05 \ + --omega 1.0 +``` + +```bash +./scripts/launch_hf_space.sh --upload-dataset +./scripts/launch_hf_space.sh --max-steps 1 +./scripts/launch_hf_space.sh --with-dataset-upload --max-steps 1 +./scripts/launch_hf_space.sh +./scripts/launch_hf_space.sh --delete +``` + +--- + +## Troubleshooting + +### Planning-only episodes with reward **0.05** + +Backfill / rubric can assign a small **\(R\)** via **`tool_density`**, but dense **`r_t`** on steps may stay **0** until a plan and subtask scores exist—little HCAPO signal after clipping. + +### OOM on first training step + +If failure is inside **`cross_entropy`** on full logits, ensure the **chunked backbone + `lm_head`** path is active (see **`HCAPOTrainer.compute_loss`**). Fallback: lower **`max_seq_length`**. + +### `RuntimeError` … `lm_head` / hidden mismatch + +Usually means the resolved “backbone” was still a **full CausalLM** instead of **`Qwen3_5TextModel`**. Check [`_get_backbone_and_lm_head()`](train_hcapo.py) unwrapping. + +### SGLang OOM during hindsight + +Avoid full-prompt logprob modes; keep **`/generate`** + **`logprob_start_len`** + a modest **`--max-logprob-tokens`**. + +### Space killed before training finishes + +Ensure the **7860** stub server is running and the main process is not **`exec`**’d as the only PID without a follow-up **`sleep`**. + +### Wrong Trackio project + +Verify **`REPORT_TO`**, **`TRACKIO_SPACE_ID`**, **`TRACKIO_PROJECT_NAME`**, **`RUN_NAME`**, and the **`TRACKIO_*`** aliases. + +--- + +## File map + +| Stage | Script / artefact | +| --- | --- | +| Collect | [`scripts/collect_trajectories.py`](../scripts/collect_trajectories.py) | +| Backfill reward | [`scripts/backfill_rewards.py`](../scripts/backfill_rewards.py) | +| Hindsight | [`scripts/compute_hindsight_scores.py`](../scripts/compute_hindsight_scores.py) | +| Build JSONL | [`scripts/build_hcapo_dataset.py`](../scripts/build_hcapo_dataset.py) | +| Train | [`training/train_hcapo.py`](train_hcapo.py) | +| HF Space | [`scripts/launch_hf_space.sh`](../scripts/launch_hf_space.sh), [`Dockerfile.train`](Dockerfile.train) | + +--- + +## References + +- HCAPO paper: [arXiv:2603.08754](https://arxiv.org/abs/2603.08754), [HTML + Appendix B](https://arxiv.org/html/2603.08754v1). +- Root README: [Training (offline RL)](../README.md#training-offline-rl).