3xcaffeine · rycerzes · Apr 26, 2026 · Apr 26, 2026 · Apr 26, 2026 · Apr 26, 2026
diff --git a/README.md b/README.md
diff --git a/assets/blog.md b/assets/blog.md
@@ -0,0 +1,98 @@
+# Building long-horizon SWE environments on Hugging Face: Frontier SWE × OpenEnv
+
+**By the-thing**: we packaged and adapted 4 [FrontierSWE](https://www.frontierswe.com/) tasks as [OpenEnv](https://github.com/rycerzes/OpenEnv)-shaped services, pushed them to **Hugging Face Spaces**, and ran an **offline RL-style** training loop with public **datasets**, **Trackio** metrics, and a trainer Space.
+
+---
+
+## TL;DR
+
+- **Four Dockerized environments** (notebook compression, Postgres wire adapter on SQLite, dependent type checker, libexpat → x86-64 asm) with a **shared Gym-style API** and **MCP** tools for planning and submission.
+- **Custom harness adapter** built on top of OpenEnv harness work ([meta-pytorch/OpenEnv PR #389](https://github.com/meta-pytorch/OpenEnv/pull/389) and RFC005), then forked and extended in [`rycerzes/OpenEnv` on `feature/pi-harness-adapter`](https://github.com/rycerzes/OpenEnv/commits/feature/pi-harness-adapter/).
+- **Composite rubric**: gates → L1 (tests / `reward.json` / regex ratios) → optional LLM layers → **episode reward** you can log and filter on for training.
+- **Offline pipeline**: trajectories on the Hub → hindsight scoring (SGLang) → HCAPO-style dataset → **LoRA fine-tune** on a GPU Space, with **Trackio** curves for loss, LR, and gradient norms.
+
+**Try it:** [frontier-swe-postgres](https://huggingface.co/spaces/rycerzes/frontier-swe-postgres) · [frontier-swe-notebook](https://huggingface.co/spaces/rycerzes/frontier-swe-notebook) · [frontier-swe-type-checker](https://huggingface.co/spaces/rycerzes/frontier-swe-type-checker) · [frontier-swe-libexpat-to-x86asm](https://huggingface.co/spaces/rycerzes/frontier-swe-libexpat-to-x86asm) · [source on GitHub](https://github.com/3xcaffeine/frontier-swe-openenv)
+
+---
+
+## 1. Environment innovation - why this setup is hard (and worth it)
+
+Classic coding benchmarks often score a single patch. **Long-horizon software engineering** is different: the agent has to **plan**, **edit a real workspace**, **call tools**, and **submit** work over many steps-closer to how people ship systems than to a one-shot fix.
+
+**What we built on top of that idea**
+
+We did not reinvent the underlying FrontierSWE task specs; we **re-homed** them inside a **uniform environment contract**:
+
+That includes a **custom harness adapter** layer we built on top of [meta-pytorch/OpenEnv PR #389](https://github.com/meta-pytorch/OpenEnv/pull/389) and RFC005, then maintained and updated in our fork: [`rycerzes/OpenEnv` `feature/pi-harness-adapter`](https://github.com/rycerzes/OpenEnv/tree/feature/pi-harness-adapter/).
+
+| Piece | What it does for the agent |
+| --- | --- |
+| **HTTP control** | `reset` / `step` / `state` / `health` - same shape every task, so harnesses and demos do not fork per domain. Maintaining the `openenv` specs |
+| **MCP tools** | `submit_plan`, `submit_subtask`, `get_status`, `advance` - forces **explicit decomposition** and **scored subtasks**, not a single anonymous blob of edits. |
+| **Multi-layer rubric** | **Gates** catch broken builds or missing artifacts early; **L1** is task-native (wire compat tests, notebook round-trips, type-checker scores, assembly benchmarks); **L2/L3** optionally add LLM code and plan review when grader env vars are set; **episode reward** blends plan quality, frozen subtask scores, completion, and tool usage. |
+
+That combination is deliberately **stressful** in a good way: the agent must **coordinate** (plan → execute → advance), **respect verifier reality** (hidden tests, anti-cheat), and **earn** a dense scalar at the end of an episode that can run on the order of **45–90+ minutes** per run-so the environment is **challenging**, **creative** in how it composes rubrics, and **meaningful** for measuring behavior beyond single-turn chat.
+
+---
+
+## 2. The problem, the box, and what the agent actually does
+
+**Problem.** Training or evaluating agents on real long-horizon SWE needs a **repeatable service**: same ports, same instructions, same scoring, same tool surface-whether you run locally, in CI, or on the Hub.
+
+**Our box.** **frontier-swe-openenv** is a small monorepo: `tasks/<task-id>/` holds instructions and verifiers (what “correct” means operationally); `frontier_swe_env/` holds the **FastAPI** server, shared rubrics, and **TaskConfig** (how to invoke those verifiers inside the image); `spaces/` holds thin **Space** definitions synced from `main` after images build.
+
+**Agent behavior (easy to follow for a demo).**
+
+1. Connect (WebSocket client or baseline script).
+2. `reset` → read observation / phase.
+3. Loop: natural language or tool use → `step` → optional MCP calls to **submit a plan**, run **L1+L2** on a **subtask**, **advance** when satisfied.
+4. Episode ends with a **terminal episode reward** and subtask history you can log.
+
+For a **concrete walkthrough without writing your own client**, the repo ships [`scripts/run_baseline.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/scripts/run_baseline.py): point it at `http://localhost:8000` with a task container running, and you get a full **reset → step** episode over the wire-good for recordings and “here is one turn of the loop” explanations.
+
+---
+
+## 3. Observable training progress - rewards, curves
+
+Long episodes make **online** RL on the live env impractical at scale, so we invested in **offline** learning: **collect once**, **score offline**, **fine-tune**, **log everything**.
+
+**Public artifacts (HF-native story)**
+
+| Artifact | Link | Role in the demo |
+| --- | --- | --- |
+| Raw trajectories (pg-01, Qwen 3.6 27B) | [`rycerzes/fswe-pg-01-traj-q36-27b`](https://huggingface.co/datasets/rycerzes/fswe-pg-01-traj-q36-27b) | Shows **what** we logged per episode (`result.json`, sessions, logs, hindsight when present). |
+| HCAPO training JSONL | [`rycerzes/fswe-hcapo-pg-01-trajectories`](https://huggingface.co/datasets/rycerzes/fswe-hcapo-pg-01-trajectories) | **Step-level advantages** paired with messages for supervised fine-tuning. |
+| Trackio dashboard | [`rycerzes/trackio`](https://huggingface.co/spaces/rycerzes/trackio) | **Observable** loss, epoch, learning rate, gradient norm, global step. |
+
+On a **3 epoch / ~18 optimizer step** reference run (Space-backed trainer), the root README documents what we see in Trackio: **loss** trending down on the order of **~25%** over the plotted window (smoothed), **epoch** progressing toward **~2.7**, **LR** warmup-then-decay, **gradient norms** staying in a moderate band-i.e. a **sanity fine-tune** where optimization looks stable, not a mystery box.
+
+We also ship a **static dashboard figure** in-repo for slides and blog embeds: [`assets/training-trackio-dashboard.png`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/assets/training-trackio-dashboard.png).
+
+**Before / after.** The cleanest **before/after** we surface in tooling today is **training loss and optimization metrics** on the HCAPO dataset, plus **episode-level rewards inside collected trajectories** for analysis. A live **A/B rollout score** on the full Docker env after LoRA is the natural next chapter for the demo-and the pipeline is set up so you can **regenerate trajectories** with the adapted policy and compare distributions. For hackathon judging, the **curves + public datasets + reproducible launch script** are the evidence chain we stand behind *right now*.
+
+---
+
+## 4. Reward logic and training pipeline - coherent signal end to end
+
+**Episode reward (macro).** The scalar \(R\) matches [`EpisodeRubric`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/frontier_swe_env/rubrics/episode_rubric.py): weighted **plan score**, mean **frozen subtask** scores, **completion**, and **tool density**-clipped into **[0, 1]** for filtering (e.g. `--min-reward 0.05` in the dataset builder).
+
+**L1 (micro, task-specific).** Each task implements its own verifier output: **regex ratio** on test totals (Postgres), **`reward_json`** fields (notebook), or **`reward_json_score`** with anchors (type checker, libexpat). Same server code paths; different physics.
+
+**Training path (why it should move policy behavior).**
+
+1. [`collect_trajectories.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/scripts/collect_trajectories.py) - rollouts into `trajectories/episode_NNN/`.
+2. [`backfill_rewards.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/scripts/backfill_rewards.py) - repair missing `episode_reward` when needed.
+3. [`compute_hindsight_scores.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/scripts/compute_hindsight_scores.py) - SGLang `/generate` with bounded logprob windows (memory-safe), MCP-aware **step → subtask** mapping, hindsight \(Q^H\) and smoothing.
+4. [`build_hcapo_dataset.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/scripts/build_hcapo_dataset.py) - GRPO-style macro advantages + normalized hindsight micro advantages → **JSONL** with **per-step weights**.
+5. [`train_hcapo.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/training/train_hcapo.py) + [`launch_hf_space.sh`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/scripts/launch_hf_space.sh) - **weighted CE on assistant tokens** (chunked forward for large models), Trackio reporting.
+
+Coherent design is means that environment reward defines **which episodes matter**; hindsight defines **which tokens inside those episodes** get gradient; the trainer respects **assistant masks** and **step weights** so the update is not “one scalar smeared across the whole transcript.” Details and equations live in [`training/README.md`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/training/README.md)
+
+---
+
+## Where to go next
+
+- **Run a Space** from the TL;DR links and narrate **one** subtask submission end to end.
+- **Open Trackio** to the named run and zoom the **loss / LR** panel while you talk through the pipeline slide.
+- **Clone the repo**, `uv sync`, and use **`./scripts/launch_hf_space.sh`** when you want the full HF training path on your own account.
+
diff --git a/assets/training-trackio-dashboard.png b/assets/training-trackio-dashboard.png
diff --git a/spaces/libexpat-to-x86asm/README.md b/spaces/libexpat-to-x86asm/README.md
@@ -10,11 +10,82 @@ pinned: false
 
 # Frontier SWE — libexpat to x86-64 Assembly
 
-OpenEnv-shaped FastAPI service hosting the libexpat-to-x86asm task.
+OpenEnv-shaped **FastAPI** service for the **libexpat-to-x86asm** task: reimplement **libexpat 2.6.4** in **x86-64 assembly**, producing `/app/asm-port/libexpat.so` with the **expat C ABI**. The verifier compares against reference C libexpat, runs upstream tests and benchmarks, and writes `/logs/verifier/reward.json` (correctness and performance blend; hard fail to `0.0` on anti-cheat or missing `.so`).
 
-- Source repo: <https://github.com/3xcaffeine/frontier-swe-openenv>
-- Container image: `ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-libexpat-to-x86asm:latest`
-- Health: `/health`
-- MCP JSON-RPC: `/mcp`
+## The task in depth
 
-Deployed automatically from `main` via the `sync-hf-spaces` workflow.
+The agent’s deliverable is a **shared library** built from **`.s` / `.asm`** sources under **`/app/asm-port/`**, exporting symbols such as **`XML_ParserCreate`** so the upstream **expat** test suite can link against it. There is **no C compiler** in the agent environment; the verifier may compile reference C code for comparison. Scoring combines **weighted test pass rates** with **benchmark timing ratios** (reference time vs agent time) into a single **`score`** in **`reward.json`**, with explicit anti-cheat checks (no `dlopen` of system libexpat, no smuggled C core files, etc.). The server treats that file in **`reward_json_score`** mode with anchors **`(0.0, 1.0)`**.
+
+## How this maps to the monorepo
+
+- **`tasks/libexpat-to-x86asm/`** — Instructions, encrypted or staged toolchain bundles as designed, **`tests/`** with **`test.sh`**, **`compute_reward.py`**, and benchmark XML generators.
+- **`frontier_swe_env/tasks/libexpat_to_x86asm.py`** — **`TaskConfig`**: workspace **`/app/asm-port`**, gate script, verifier command, JSON path and anchors, CPU/memory hints, and judge context strings.
+- **`spaces/libexpat-to-x86asm/`** — This Space and manifest.
+
+See [**Task assets and runtime configuration**](https://github.com/3xcaffeine/frontier-swe-openenv#task-assets-and-runtime-configuration) in the root README.
+
+## Features
+
+- **Assembly port workspace**: `/app/asm-port` with staged toolchain and bundles (see gate checks in manifest).
+- **Structured L1**: Normalised score from `reward.json`; gates for writable workspace, headers, `nasm` / `as` / `ld`, and staged artifacts.
+- **LLM rubric layers**: L2 code review and L3 plan review when grader env vars are set.
+- **MCP tools**: `submit_plan`, `submit_subtask`, `get_status`, `advance`.
+
+## HTTP API
+
+| Endpoint | Notes |
+| --- | --- |
+| `GET /health` | Liveness. |
+| `POST /reset`, `POST /step`, `GET /state` | OpenEnv Gym-style control. |
+| `POST /mcp` | OpenEnv JSON-RPC MCP. |
+| `/tools/mcp` | FastMCP Streamable HTTP. |
+
+## Quick start (Docker)
+
+```bash
+docker run --rm -p 8000:8000 \
+  ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-libexpat-to-x86asm:latest
+```
+
+This task is CPU- and memory-sensitive; the manifest requests **4 CPUs** and **8192 MiB** where the platform allows.
+
+```bash
+docker run --rm -p 8000:8000 \
+  -e FSWE_GRADER_MODEL=... \
+  -e FSWE_GRADER_API_URL=... \
+  -e FSWE_GRADER_API_KEY=... \
+  ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-libexpat-to-x86asm:latest
+```
+
+## Python client (host)
+
+```python
+import asyncio
+from frontier_swe_env.client import FrontierSweEnv
+from frontier_swe_env.models import FrontierSweAction
+
+
+async def main():
+    client = FrontierSweEnv(base_url="http://localhost:8000")
+    await client.connect()
+    try:
+        await client.reset()
+        await client.step(FrontierSweAction(message="Continue the assembly port."))
+    finally:
+        await client.close()
+
+
+asyncio.run(main())
+```
+
+## Task manifest
+
+[`openenv.yaml`](openenv.yaml) — episode timeout, L1 timeout, reward field anchors, rubric layers, metrics. Task sources: `tasks/libexpat-to-x86asm/`.
+
+## Deployment
+
+- **Image**: `ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-libexpat-to-x86asm:latest`
+- **Source**: [3xcaffeine/frontier-swe-openenv](https://github.com/3xcaffeine/frontier-swe-openenv)
+- **Sync**: HF Space updated from `main` after successful GHCR build.
+
+Benchmark context: [FrontierSWE — libexpat to x86-64 assembly](https://www.frontierswe.com/libexpat-to-x86asm).
diff --git a/spaces/notebook/README.md b/spaces/notebook/README.md
@@ -10,12 +10,84 @@ pinned: false
 
 # Frontier SWE — Notebook Compression
 
-OpenEnv-shaped FastAPI service hosting the notebook-compression task.
+OpenEnv-shaped **FastAPI** service for the **notebook-compression** task: build a fit / compress / decompress pipeline for Jupyter notebooks inside a Linux workspace, with multi-layer rubric scoring and a structured `reward.json` written by the verifier.
 
-- Source repo: <https://github.com/3xcaffeine/frontier-swe-openenv>
-- Container image: `ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-notebook:latest`
-- Health: `/health`
-- MCP JSON-RPC: `/mcp`
+## The task in depth
 
-Deployed automatically from `main` via the `sync-hf-spaces` workflow.
+The agent needs to ship an executable **`/app/run`** with three subcommands: **`fit`** (train or build artifacts from a **visible** corpus only), **`compress`**, and **`decompress`**. At scoring time the agent does not see the hidden corpus: the verifier checks **byte-for-byte** recovery of every notebook file. Compression quality is summarised as a geometric mean of size ratios; hard failures (round-trip mismatch, crashes, invalid `reward.json` status) collapse the L1 signal to zero. That logic lives in the repo under [`tasks/notebook-compression/tests/`](https://github.com/3xcaffeine/frontier-swe-openenv/tree/main/tasks/notebook-compression/tests) (shell driver plus [`compute_reward.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/tasks/notebook-compression/tests/compute_reward.py)), which writes **`/logs/verifier/reward.json`** for the server to read.
 
+## How this maps to the monorepo
+
+- **`tasks/notebook-compression/`** — Authoritative instructions, verifier, and reward computation; copied into the image (for example **`/opt/verifier/test.sh`** and data mounts).
+- **`frontier_swe_env/tasks/notebook_compression.py`** — Registers **`TaskConfig`** with `l1_score_mode="reward_json"`, the container test command, long L1 timeouts, gate path, and prose for L2/L3 judges. The running server selects it when `FSWE_TASK_NAME` is `notebook` or `notebook-compression` (see [`__init__.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/frontier_swe_env/tasks/__init__.py)).
+- **`spaces/notebook/`** — This Space: thin Dockerfile, this README, and **`openenv.yaml`** describing the same episode for Hugging Face and external tooling.
+
+For the full picture of how task directories and Python configs interact, see the root README section [**Task assets and runtime configuration**](https://github.com/3xcaffeine/frontier-swe-openenv#task-assets-and-runtime-configuration).
+
+## Features
+
+- **Long-horizon SWE**: Plan subtasks, edit code under the configured workspace, submit for scoring.
+- **Composite rubric**: Shell gate checks → structured L1 from `/logs/verifier/reward.json` → optional LLM code review (L2) and plan review (L3) → weighted episode reward.
+- **MCP tools**: `submit_plan`, `submit_subtask`, `get_status`, `advance` (same contract as other Frontier SWE Spaces).
+- **Dual MCP transports**: OpenEnv `POST /mcp` and Streamable HTTP `/tools/mcp` for adapters.
+
+## HTTP API
+
+| Endpoint | Notes |
+| --- | --- |
+| `GET /health` | Liveness for orchestration and HF health checks. |
+| `POST /reset`, `POST /step`, `GET /state` | OpenEnv Gym-style control. |
+| `POST /mcp` | OpenEnv JSON-RPC MCP. |
+| `/tools/mcp` | FastMCP Streamable HTTP (POST + GET/SSE). |
+
+## Quick start (Docker)
+
+```bash
+docker run --rm -p 8000:8000 \
+  ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-notebook:latest
+```
+
+Optional grader configuration for LLM rubric layers:
+
+```bash
+docker run --rm -p 8000:8000 \
+  -e FSWE_GRADER_MODEL=... \
+  -e FSWE_GRADER_API_URL=... \
+  -e FSWE_GRADER_API_KEY=... \
+  ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-notebook:latest
+```
+
+## Python client (host)
+
+From the [source repository](https://github.com/3xcaffeine/frontier-swe-openenv), with dependencies installed:
+
+```python
+import asyncio
+from frontier_swe_env.client import FrontierSweEnv
+from frontier_swe_env.models import FrontierSweAction
+
+
+async def main():
+    client = FrontierSweEnv(base_url="http://localhost:8000")
+    await client.connect()
+    try:
+        await client.reset()
+        await client.step(FrontierSweAction(message="Continue the task."))
+    finally:
+        await client.close()
+
+
+asyncio.run(main())
+```
+
+## Task manifest
+
+OpenEnv metadata for judges and tooling: [`openenv.yaml`](openenv.yaml) in this Space (mirrors `spaces/notebook/openenv.yaml` in the GitHub repo). Task sources: `tasks/notebook-compression/`.
+
+## Deployment
+
+- **Image**: `ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-notebook:latest`
+- **Source**: [3xcaffeine/frontier-swe-openenv](https://github.com/3xcaffeine/frontier-swe-openenv)
+- **Sync**: Pushed from `main` by the repository’s HF Spaces sync workflow after GHCR builds succeed.
+
+Benchmark context: [FrontierSWE — Notebook compression](https://www.frontierswe.com/notebook-compression).