Evaluation harness for LLM-based design verification agents. Generates scored trajectories against broken RTL, decomposes reward across five components, emits DPO-ready preference pairs.
Given a buggy RTL module and a known root cause, the harness drives an agent through a 5-step debug trajectory, scores the trajectory deterministically, and persists the trace as JSONL. Trajectories with chosen/rejected pairs become DPO training data.
The simulator boundary is an adapter. Mock for fast iteration, Icarus for free metal, Cocotb+pyuvm for Python-native UVM. Questa and VCS slot in behind the same interface.
Adapter at the simulator boundary. The DV simulator landscape is fragmented (Icarus, Verilator, Questa, VCS, Xcelium). Hardcoding any one couples the harness to a vendor and breaks portability across customer environments. Same trajectory runs against any backend.
Discriminated unions on bug family. Cases are typed by family (FIFO, FSM, arbiter, AXI-Lite). Each subclass enforces family-specific fields at ingestion — FIFO cases require pointer width, FSM cases require state enum and encoding, arbiters require sticky semantics, AXI-Lite requires channel and violation type. Pydantic v2 routes by the family field. Invalid cases fail at load, not at runtime.
Reward decomposition over scalar. A scalar reward hides what the agent did right or wrong. The harness emits five components — root cause, evidence quality, tool use correctness, fix plausibility, no-hallucination — plus a per-step PRM mean folded into the total. Decomposed rewards are diagnostic; scalar rewards are debug-hostile.
Categorical penalties for bright-line violations only. Modifying forbidden targets (scoreboards, monitors, testbenches) triggers a fixed scalar penalty. Fuzzy gaming detection is not handled in the reward function — it belongs in the trajectory audit layer where the agent can't optimize against it.
Policy-enforced execution. Simulator runs accept a SimulationPolicy with watchdog timeouts, per-process memory limits where supported, maximum retained log bytes, allowed write roots, and protected verification-asset tokens. The mock, Icarus, and Cocotb adapters return a structured SimulationReport in raw_artifacts, with stdout parsing marked as fallback evidence.
JSONL trace persistence. Append-only, grep-able, replayable. No ORM ceremony for what is fundamentally a log.
For each case the agent executes:
- Baseline simulation on broken RTL — capture the failure signature
- Log analysis — filter for UVM_ERROR, ASSERTION FAILED, FATAL
- RTL inspection — scan for configured bug signatures
- Fix proposal
- Re-run — measure coverage delta and final reward
R_total = w_rc·R_root_cause
+ w_eq·R_evidence
+ w_pr·R_prm_mean
+ w_fp·R_fix_plausibility
+ w_tu·R_tool_use
− Σ penalties
Weights sum to 1.0, asserted at module load. Penalties fire on protocol-level violations. PRM mean injects per-step process reward so trajectory-level scoring is sensitive to reasoning quality, not just final outcome.
200 cases across four bug families, generated from hand-authored blueprints against the discriminated union schemas. Each case validates end-to-end through the harness before inclusion. Families cover the four primitives of digital design — storage, protocol, sequential, concurrent.
| Family | Cases | Tests |
|---|---|---|
| FIFO buffers | 50 | Pointer arithmetic, full/empty flag races, overflow/underflow |
| AXI-Lite | 50 | Handshake (valid/ready ordering), address phase, response codes |
| FSM controllers | 50 | Transitions, stuck states, encoding width, default-case latches |
| Round-robin | 50 | Fairness, sticky grants, last-granted rotation |
Each case generates one DPO preference pair (chosen fix vs rejected fix). 200 pairs is the floor for QLoRA + DPO on a 7B base — enough to measurably shift behavior without overfitting to a single bug class.
| Layer | Choice | Why |
|---|---|---|
| Orchestrator | Python 3.12 + FastAPI | Async, boring, fast to ship |
| Packaging | uv | Fast resolves, lockfile reproducibility |
| Schemas | Pydantic v2 | Discriminated unions, strict validation |
| Sim (free) | Icarus Verilog | Real metal, no license |
| Sim (Python) | Cocotb + pyuvm | Pythonic UVM, integrates directly |
| Preference learning | PyTorch (DPO) | Offline, no reward model to train, no rollouts |
sudo apt update && sudo apt install iverilog -y
curl -LsSf https://astral.sh/uv/install.sh | sh
cd backend
uv sync
uv run smoke_test.py
uv run pytest
uvx ruff check .
uv run python scripts/run_demo_trajectories.pyExpected:
AXI valid drops before ready... R_Total: 0.99 ok
FSM stuck in IDLE... R_Total: 0.99 ok
UART FIFO overflow write... R_Total: 0.98 ok
Suite complete. Results saved to smoke_test_results.json
To generate live LLM trajectories (requires OPENAI_API_KEY):
cd backend
OPENAI_API_KEY=sk-... PYTHONPATH=. uv run python scripts/run_demo_trajectories.pyTraces are saved to backend/traces/demo/*.openai.trajectory.json.
- Adapter boundary with Mock active, Icarus implemented, and Cocotb/pyuvm scaffolded behind the same simulator interface
- Discriminated union schemas for FIFO / FSM / arbiter / AXI-Lite cases
- Family-level design pattern schemas for canonical FIFO / FSM / arbiter structure
- Reward engine: 5-component decomposition + PRM mean injection, weight invariant asserted at load, regex word-boundary substring matching, clamped at zero
- Safety layer: workspace diff audit, path-scope audit, absolute-path/traversal/symlink-escape rejection, tripwire audit, structured simulation reports, bounded retained logs, simulator watchdog policy, and process memory-limit hooks
- R₂ holdout telemetry and R₁/R₂ gap tracking for threshold-hugging detection
- Unbiased pass@k estimator for multi-sample evaluation reporting
- Supabase persistence verified against
eval_runs, with JSONL fallback for local/offline runs - DPO primitives: preference-pair generation and beta-regularized DPO loss
- GitHub Actions CI for locked install, pytest, and ruff
- Reward, schema, storage, workspace-audit, metrics, and safety test suite
The harness supports a live OpenAI backend alongside the deterministic baseline:
from app.services.agent_runner import run_agent_on_case
trajectory = run_agent_on_case(
case,
agent_backend="openai",
model="gpt-4o",
simulator_name="mock",
)The DockerAgentBackend routes the same call through a hardened per-trajectory container:
trajectory = run_agent_on_case(
case,
agent_backend="docker",
model="dv-eval-harness:trajectory",
simulator_name="mock",
)Container hardening posture: --read-only, --cap-drop=ALL, --security-opt=no-new-privileges, --pids-limit=100, --tmpfs /tmp:size=64M, --tmpfs /workspace:size=128M,exec, --memory={limit}m, --cpus=1.0. The API key is injected as an env var reference — never written to the filesystem or included in trajectory JSON.
Every LLM call is logged with endpoint, model, token counts, and wall-clock duration in submission.metadata. The audit trail is deterministic: same case + same model produces the same metadata structure.
Bring-your-own-model: LLMAgent is the base class. Swap the provider by subclassing and overriding step(). TrajectoryAgent (OpenAI) and GeneratorAgent (case synthesis) are the two v1 implementations.
Current (v1): blueprint-based deterministic generator. Each bug family has a typed blueprint (FIFOBlueprint, FSMBlueprint, ArbiterBlueprint, AXIBlueprint) that parameterizes bug injection — pointer width, state encoding, grant policy, handshake violation type. The generator produces validated DVCase instances that pass full schema checks before inclusion.
v2 planned: mutation-rule based generation from source RTL (CVA6, RTLLM corpus). A rule describes a structural transformation (e.g. blocking-to-nonblocking assignment, missing reset branch, off-by-one in pointer arithmetic). An AST-aware mutator applies the rule to real RTL and produces a case with a ground-truth root cause. An LLM verifies the testbench catches the injected bug before the case is admitted. This eliminates hand-authored blueprints as the scale ceiling.
- Network egress lockdown — Docker network policy to block outbound connections from inside the trajectory container; model calls route through a pinned sidecar proxy
- AST-aware semantic mutator — tree-sitter or slang parser; mutations preserve syntactic validity and target semantically meaningful constructs (assignments, resets, port directions, state transitions)
- LLM-assisted mutation generation — use a generator model to propose mutation rules from RTL, filter by testbench catch rate; replaces hand-authored blueprints
- On-prem / customer-deployed model adapter —
LLMAgentsubclass targeting a local inference endpoint; allowlist extended to private HTTPS hosts - Multi-model comparative trajectory evaluation — run N backends on the same case, compare
r_totaldistributions; feed divergence into DPO pair selection
- Validate Cocotb + pyuvm with executable Python testbenches and make it selectable from the API/CLI
- Wire family-level design pattern schemas into structural conformance scoring beyond substring checks
- Replace stdout fallback parsing with simulator-native XML/JSON/UCDB ingestion where available
- Expand trajectory audit layer for deeper forensic gaming detection (CoT-action coherence, fix-before-evidence, hallucinated citations, conditional-independence violations)
- QLoRA + DPO fine-tune on Mistral 7B / Llama 3 8B against harness-generated preference data
- Questa and VCS adapters
- Next.js dashboard for Supabase-backed trajectory leaderboard
Built as a focused demonstration of the eval harness layer for DV agents. Architecture decisions are deliberate; coverage is intentionally narrow (four bug families) to ship a working end-to-end loop before scaling cases. The roadmap items are not vaporware — each one names a specific failure mode in the current implementation that the upgrade addresses.
Anthony Eugene Lewallen
End-to-End AI Systems Engineer · Model Internals → MLOps + Agentic Systems
From the Metal to the Agent Level
B.S. Mathematics Operations Research, Summa Cum Laude — American Public University
MAS-CS (Software Systems) + MSE-AI — University of Pennsylvania