Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
278 changes: 278 additions & 0 deletions README.md

Large diffs are not rendered by default.

98 changes: 98 additions & 0 deletions assets/blog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Building long-horizon SWE environments on Hugging Face: Frontier SWE × OpenEnv

**By the-thing**: we packaged and adapted 4 [FrontierSWE](https://www.frontierswe.com/) tasks as [OpenEnv](https://github.com/rycerzes/OpenEnv)-shaped services, pushed them to **Hugging Face Spaces**, and ran an **offline RL-style** training loop with public **datasets**, **Trackio** metrics, and a trainer Space.

---

## TL;DR

- **Four Dockerized environments** (notebook compression, Postgres wire adapter on SQLite, dependent type checker, libexpat → x86-64 asm) with a **shared Gym-style API** and **MCP** tools for planning and submission.
- **Custom harness adapter** built on top of OpenEnv harness work ([meta-pytorch/OpenEnv PR #389](https://github.com/meta-pytorch/OpenEnv/pull/389) and RFC005), then forked and extended in [`rycerzes/OpenEnv` on `feature/pi-harness-adapter`](https://github.com/rycerzes/OpenEnv/commits/feature/pi-harness-adapter/).
- **Composite rubric**: gates → L1 (tests / `reward.json` / regex ratios) → optional LLM layers → **episode reward** you can log and filter on for training.
- **Offline pipeline**: trajectories on the Hub → hindsight scoring (SGLang) → HCAPO-style dataset → **LoRA fine-tune** on a GPU Space, with **Trackio** curves for loss, LR, and gradient norms.

**Try it:** [frontier-swe-postgres](https://huggingface.co/spaces/rycerzes/frontier-swe-postgres) · [frontier-swe-notebook](https://huggingface.co/spaces/rycerzes/frontier-swe-notebook) · [frontier-swe-type-checker](https://huggingface.co/spaces/rycerzes/frontier-swe-type-checker) · [frontier-swe-libexpat-to-x86asm](https://huggingface.co/spaces/rycerzes/frontier-swe-libexpat-to-x86asm) · [source on GitHub](https://github.com/3xcaffeine/frontier-swe-openenv)

---

## 1. Environment innovation - why this setup is hard (and worth it)

Classic coding benchmarks often score a single patch. **Long-horizon software engineering** is different: the agent has to **plan**, **edit a real workspace**, **call tools**, and **submit** work over many steps-closer to how people ship systems than to a one-shot fix.

**What we built on top of that idea**

We did not reinvent the underlying FrontierSWE task specs; we **re-homed** them inside a **uniform environment contract**:

That includes a **custom harness adapter** layer we built on top of [meta-pytorch/OpenEnv PR #389](https://github.com/meta-pytorch/OpenEnv/pull/389) and RFC005, then maintained and updated in our fork: [`rycerzes/OpenEnv` `feature/pi-harness-adapter`](https://github.com/rycerzes/OpenEnv/tree/feature/pi-harness-adapter/).

| Piece | What it does for the agent |
| --- | --- |
| **HTTP control** | `reset` / `step` / `state` / `health` - same shape every task, so harnesses and demos do not fork per domain. Maintaining the `openenv` specs |
| **MCP tools** | `submit_plan`, `submit_subtask`, `get_status`, `advance` - forces **explicit decomposition** and **scored subtasks**, not a single anonymous blob of edits. |
| **Multi-layer rubric** | **Gates** catch broken builds or missing artifacts early; **L1** is task-native (wire compat tests, notebook round-trips, type-checker scores, assembly benchmarks); **L2/L3** optionally add LLM code and plan review when grader env vars are set; **episode reward** blends plan quality, frozen subtask scores, completion, and tool usage. |

That combination is deliberately **stressful** in a good way: the agent must **coordinate** (plan → execute → advance), **respect verifier reality** (hidden tests, anti-cheat), and **earn** a dense scalar at the end of an episode that can run on the order of **45–90+ minutes** per run-so the environment is **challenging**, **creative** in how it composes rubrics, and **meaningful** for measuring behavior beyond single-turn chat.

---

## 2. The problem, the box, and what the agent actually does

**Problem.** Training or evaluating agents on real long-horizon SWE needs a **repeatable service**: same ports, same instructions, same scoring, same tool surface-whether you run locally, in CI, or on the Hub.

**Our box.** **frontier-swe-openenv** is a small monorepo: `tasks/<task-id>/` holds instructions and verifiers (what “correct” means operationally); `frontier_swe_env/` holds the **FastAPI** server, shared rubrics, and **TaskConfig** (how to invoke those verifiers inside the image); `spaces/` holds thin **Space** definitions synced from `main` after images build.

**Agent behavior (easy to follow for a demo).**

1. Connect (WebSocket client or baseline script).
2. `reset` → read observation / phase.
3. Loop: natural language or tool use → `step` → optional MCP calls to **submit a plan**, run **L1+L2** on a **subtask**, **advance** when satisfied.
4. Episode ends with a **terminal episode reward** and subtask history you can log.

For a **concrete walkthrough without writing your own client**, the repo ships [`scripts/run_baseline.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/scripts/run_baseline.py): point it at `http://localhost:8000` with a task container running, and you get a full **reset → step** episode over the wire-good for recordings and “here is one turn of the loop” explanations.

---

## 3. Observable training progress - rewards, curves

Long episodes make **online** RL on the live env impractical at scale, so we invested in **offline** learning: **collect once**, **score offline**, **fine-tune**, **log everything**.

**Public artifacts (HF-native story)**

| Artifact | Link | Role in the demo |
| --- | --- | --- |
| Raw trajectories (pg-01, Qwen 3.6 27B) | [`rycerzes/fswe-pg-01-traj-q36-27b`](https://huggingface.co/datasets/rycerzes/fswe-pg-01-traj-q36-27b) | Shows **what** we logged per episode (`result.json`, sessions, logs, hindsight when present). |
| HCAPO training JSONL | [`rycerzes/fswe-hcapo-pg-01-trajectories`](https://huggingface.co/datasets/rycerzes/fswe-hcapo-pg-01-trajectories) | **Step-level advantages** paired with messages for supervised fine-tuning. |
| Trackio dashboard | [`rycerzes/trackio`](https://huggingface.co/spaces/rycerzes/trackio) | **Observable** loss, epoch, learning rate, gradient norm, global step. |

On a **3 epoch / ~18 optimizer step** reference run (Space-backed trainer), the root README documents what we see in Trackio: **loss** trending down on the order of **~25%** over the plotted window (smoothed), **epoch** progressing toward **~2.7**, **LR** warmup-then-decay, **gradient norms** staying in a moderate band-i.e. a **sanity fine-tune** where optimization looks stable, not a mystery box.

We also ship a **static dashboard figure** in-repo for slides and blog embeds: [`assets/training-trackio-dashboard.png`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/assets/training-trackio-dashboard.png).

**Before / after.** The cleanest **before/after** we surface in tooling today is **training loss and optimization metrics** on the HCAPO dataset, plus **episode-level rewards inside collected trajectories** for analysis. A live **A/B rollout score** on the full Docker env after LoRA is the natural next chapter for the demo-and the pipeline is set up so you can **regenerate trajectories** with the adapted policy and compare distributions. For hackathon judging, the **curves + public datasets + reproducible launch script** are the evidence chain we stand behind *right now*.

---

## 4. Reward logic and training pipeline - coherent signal end to end

**Episode reward (macro).** The scalar \(R\) matches [`EpisodeRubric`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/frontier_swe_env/rubrics/episode_rubric.py): weighted **plan score**, mean **frozen subtask** scores, **completion**, and **tool density**-clipped into **[0, 1]** for filtering (e.g. `--min-reward 0.05` in the dataset builder).

**L1 (micro, task-specific).** Each task implements its own verifier output: **regex ratio** on test totals (Postgres), **`reward_json`** fields (notebook), or **`reward_json_score`** with anchors (type checker, libexpat). Same server code paths; different physics.

**Training path (why it should move policy behavior).**

1. [`collect_trajectories.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/scripts/collect_trajectories.py) - rollouts into `trajectories/episode_NNN/`.
2. [`backfill_rewards.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/scripts/backfill_rewards.py) - repair missing `episode_reward` when needed.
3. [`compute_hindsight_scores.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/scripts/compute_hindsight_scores.py) - SGLang `/generate` with bounded logprob windows (memory-safe), MCP-aware **step → subtask** mapping, hindsight \(Q^H\) and smoothing.
4. [`build_hcapo_dataset.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/scripts/build_hcapo_dataset.py) - GRPO-style macro advantages + normalized hindsight micro advantages → **JSONL** with **per-step weights**.
5. [`train_hcapo.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/training/train_hcapo.py) + [`launch_hf_space.sh`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/scripts/launch_hf_space.sh) - **weighted CE on assistant tokens** (chunked forward for large models), Trackio reporting.

Coherent design is means that environment reward defines **which episodes matter**; hindsight defines **which tokens inside those episodes** get gradient; the trainer respects **assistant masks** and **step weights** so the update is not “one scalar smeared across the whole transcript.” Details and equations live in [`training/README.md`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/training/README.md)

---

## Where to go next

- **Run a Space** from the TL;DR links and narrate **one** subtask submission end to end.
- **Open Trackio** to the named run and zoom the **loss / LR** panel while you talk through the pipeline slide.
- **Clone the repo**, `uv sync`, and use **`./scripts/launch_hf_space.sh`** when you want the full HF training path on your own account.

Binary file added assets/training-trackio-dashboard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
83 changes: 77 additions & 6 deletions spaces/libexpat-to-x86asm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,82 @@ pinned: false

# Frontier SWE — libexpat to x86-64 Assembly

OpenEnv-shaped FastAPI service hosting the libexpat-to-x86asm task.
OpenEnv-shaped **FastAPI** service for the **libexpat-to-x86asm** task: reimplement **libexpat 2.6.4** in **x86-64 assembly**, producing `/app/asm-port/libexpat.so` with the **expat C ABI**. The verifier compares against reference C libexpat, runs upstream tests and benchmarks, and writes `/logs/verifier/reward.json` (correctness and performance blend; hard fail to `0.0` on anti-cheat or missing `.so`).

- Source repo: <https://github.com/3xcaffeine/frontier-swe-openenv>
- Container image: `ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-libexpat-to-x86asm:latest`
- Health: `/health`
- MCP JSON-RPC: `/mcp`
## The task in depth

Deployed automatically from `main` via the `sync-hf-spaces` workflow.
The agent’s deliverable is a **shared library** built from **`.s` / `.asm`** sources under **`/app/asm-port/`**, exporting symbols such as **`XML_ParserCreate`** so the upstream **expat** test suite can link against it. There is **no C compiler** in the agent environment; the verifier may compile reference C code for comparison. Scoring combines **weighted test pass rates** with **benchmark timing ratios** (reference time vs agent time) into a single **`score`** in **`reward.json`**, with explicit anti-cheat checks (no `dlopen` of system libexpat, no smuggled C core files, etc.). The server treats that file in **`reward_json_score`** mode with anchors **`(0.0, 1.0)`**.

## How this maps to the monorepo

- **`tasks/libexpat-to-x86asm/`** — Instructions, encrypted or staged toolchain bundles as designed, **`tests/`** with **`test.sh`**, **`compute_reward.py`**, and benchmark XML generators.
- **`frontier_swe_env/tasks/libexpat_to_x86asm.py`** — **`TaskConfig`**: workspace **`/app/asm-port`**, gate script, verifier command, JSON path and anchors, CPU/memory hints, and judge context strings.
- **`spaces/libexpat-to-x86asm/`** — This Space and manifest.

See [**Task assets and runtime configuration**](https://github.com/3xcaffeine/frontier-swe-openenv#task-assets-and-runtime-configuration) in the root README.

## Features

- **Assembly port workspace**: `/app/asm-port` with staged toolchain and bundles (see gate checks in manifest).
- **Structured L1**: Normalised score from `reward.json`; gates for writable workspace, headers, `nasm` / `as` / `ld`, and staged artifacts.
- **LLM rubric layers**: L2 code review and L3 plan review when grader env vars are set.
- **MCP tools**: `submit_plan`, `submit_subtask`, `get_status`, `advance`.

## HTTP API

| Endpoint | Notes |
| --- | --- |
| `GET /health` | Liveness. |
| `POST /reset`, `POST /step`, `GET /state` | OpenEnv Gym-style control. |
| `POST /mcp` | OpenEnv JSON-RPC MCP. |
| `/tools/mcp` | FastMCP Streamable HTTP. |

## Quick start (Docker)

```bash
docker run --rm -p 8000:8000 \
ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-libexpat-to-x86asm:latest
```

This task is CPU- and memory-sensitive; the manifest requests **4 CPUs** and **8192 MiB** where the platform allows.

```bash
docker run --rm -p 8000:8000 \
-e FSWE_GRADER_MODEL=... \
-e FSWE_GRADER_API_URL=... \
-e FSWE_GRADER_API_KEY=... \
ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-libexpat-to-x86asm:latest
```

## Python client (host)

```python
import asyncio
from frontier_swe_env.client import FrontierSweEnv
from frontier_swe_env.models import FrontierSweAction


async def main():
client = FrontierSweEnv(base_url="http://localhost:8000")
await client.connect()
try:
await client.reset()
await client.step(FrontierSweAction(message="Continue the assembly port."))
finally:
await client.close()


asyncio.run(main())
```

## Task manifest

[`openenv.yaml`](openenv.yaml) — episode timeout, L1 timeout, reward field anchors, rubric layers, metrics. Task sources: `tasks/libexpat-to-x86asm/`.

## Deployment

- **Image**: `ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-libexpat-to-x86asm:latest`
- **Source**: [3xcaffeine/frontier-swe-openenv](https://github.com/3xcaffeine/frontier-swe-openenv)
- **Sync**: HF Space updated from `main` after successful GHCR build.

Benchmark context: [FrontierSWE — libexpat to x86-64 assembly](https://www.frontierswe.com/libexpat-to-x86asm).
84 changes: 78 additions & 6 deletions spaces/notebook/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,84 @@ pinned: false

# Frontier SWE — Notebook Compression

OpenEnv-shaped FastAPI service hosting the notebook-compression task.
OpenEnv-shaped **FastAPI** service for the **notebook-compression** task: build a fit / compress / decompress pipeline for Jupyter notebooks inside a Linux workspace, with multi-layer rubric scoring and a structured `reward.json` written by the verifier.

- Source repo: <https://github.com/3xcaffeine/frontier-swe-openenv>
- Container image: `ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-notebook:latest`
- Health: `/health`
- MCP JSON-RPC: `/mcp`
## The task in depth

Deployed automatically from `main` via the `sync-hf-spaces` workflow.
The agent needs to ship an executable **`/app/run`** with three subcommands: **`fit`** (train or build artifacts from a **visible** corpus only), **`compress`**, and **`decompress`**. At scoring time the agent does not see the hidden corpus: the verifier checks **byte-for-byte** recovery of every notebook file. Compression quality is summarised as a geometric mean of size ratios; hard failures (round-trip mismatch, crashes, invalid `reward.json` status) collapse the L1 signal to zero. That logic lives in the repo under [`tasks/notebook-compression/tests/`](https://github.com/3xcaffeine/frontier-swe-openenv/tree/main/tasks/notebook-compression/tests) (shell driver plus [`compute_reward.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/tasks/notebook-compression/tests/compute_reward.py)), which writes **`/logs/verifier/reward.json`** for the server to read.

## How this maps to the monorepo

- **`tasks/notebook-compression/`** — Authoritative instructions, verifier, and reward computation; copied into the image (for example **`/opt/verifier/test.sh`** and data mounts).
- **`frontier_swe_env/tasks/notebook_compression.py`** — Registers **`TaskConfig`** with `l1_score_mode="reward_json"`, the container test command, long L1 timeouts, gate path, and prose for L2/L3 judges. The running server selects it when `FSWE_TASK_NAME` is `notebook` or `notebook-compression` (see [`__init__.py`](https://github.com/3xcaffeine/frontier-swe-openenv/blob/main/frontier_swe_env/tasks/__init__.py)).
- **`spaces/notebook/`** — This Space: thin Dockerfile, this README, and **`openenv.yaml`** describing the same episode for Hugging Face and external tooling.

For the full picture of how task directories and Python configs interact, see the root README section [**Task assets and runtime configuration**](https://github.com/3xcaffeine/frontier-swe-openenv#task-assets-and-runtime-configuration).

## Features

- **Long-horizon SWE**: Plan subtasks, edit code under the configured workspace, submit for scoring.
- **Composite rubric**: Shell gate checks → structured L1 from `/logs/verifier/reward.json` → optional LLM code review (L2) and plan review (L3) → weighted episode reward.
- **MCP tools**: `submit_plan`, `submit_subtask`, `get_status`, `advance` (same contract as other Frontier SWE Spaces).
- **Dual MCP transports**: OpenEnv `POST /mcp` and Streamable HTTP `/tools/mcp` for adapters.

## HTTP API

| Endpoint | Notes |
| --- | --- |
| `GET /health` | Liveness for orchestration and HF health checks. |
| `POST /reset`, `POST /step`, `GET /state` | OpenEnv Gym-style control. |
| `POST /mcp` | OpenEnv JSON-RPC MCP. |
| `/tools/mcp` | FastMCP Streamable HTTP (POST + GET/SSE). |

## Quick start (Docker)

```bash
docker run --rm -p 8000:8000 \
ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-notebook:latest
```

Optional grader configuration for LLM rubric layers:

```bash
docker run --rm -p 8000:8000 \
-e FSWE_GRADER_MODEL=... \
-e FSWE_GRADER_API_URL=... \
-e FSWE_GRADER_API_KEY=... \
ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-notebook:latest
```

## Python client (host)

From the [source repository](https://github.com/3xcaffeine/frontier-swe-openenv), with dependencies installed:

```python
import asyncio
from frontier_swe_env.client import FrontierSweEnv
from frontier_swe_env.models import FrontierSweAction


async def main():
client = FrontierSweEnv(base_url="http://localhost:8000")
await client.connect()
try:
await client.reset()
await client.step(FrontierSweAction(message="Continue the task."))
finally:
await client.close()


asyncio.run(main())
```

## Task manifest

OpenEnv metadata for judges and tooling: [`openenv.yaml`](openenv.yaml) in this Space (mirrors `spaces/notebook/openenv.yaml` in the GitHub repo). Task sources: `tasks/notebook-compression/`.

## Deployment

- **Image**: `ghcr.io/3xcaffeine/frontier-swe-openenv/frontier-swe-notebook:latest`
- **Source**: [3xcaffeine/frontier-swe-openenv](https://github.com/3xcaffeine/frontier-swe-openenv)
- **Sync**: Pushed from `main` by the repository’s HF Spaces sync workflow after GHCR builds succeed.

Benchmark context: [FrontierSWE — Notebook compression](https://www.frontierswe.com/notebook-compression).
Loading
Loading