dv-eval-harness

Evaluation harness for LLM-based design verification agents. Generates scored trajectories against broken RTL, decomposes reward across five components, emits DPO-ready preference pairs.

What it does

Given a buggy RTL module and a known root cause, the harness drives an agent through a 5-step debug trajectory, scores the trajectory deterministically, and persists the trace as JSONL. Trajectories with chosen/rejected pairs become DPO training data.

The simulator boundary is an adapter. Mock for fast iteration, Icarus for free metal, Cocotb+pyuvm for Python-native UVM. Questa and VCS slot in behind the same interface.

Why the design looks like this

Adapter at the simulator boundary. The DV simulator landscape is fragmented (Icarus, Verilator, Questa, VCS, Xcelium). Hardcoding any one couples the harness to a vendor and breaks portability across customer environments. Same trajectory runs against any backend.

Discriminated unions on bug family. Cases are typed by family (FIFO, FSM, arbiter, AXI-Lite). Each subclass enforces family-specific fields at ingestion — FIFO cases require pointer width, FSM cases require state enum and encoding, arbiters require sticky semantics, AXI-Lite requires channel and violation type. Pydantic v2 routes by the family field. Invalid cases fail at load, not at runtime.

Reward decomposition over scalar. A scalar reward hides what the agent did right or wrong. The harness emits five components — root cause, evidence quality, tool use correctness, fix plausibility, no-hallucination — plus a per-step PRM mean folded into the total. Decomposed rewards are diagnostic; scalar rewards are debug-hostile.

Categorical penalties for bright-line violations only. Modifying forbidden targets (scoreboards, monitors, testbenches) triggers a fixed scalar penalty. Fuzzy gaming detection is not handled in the reward function — it belongs in the trajectory audit layer where the agent can't optimize against it.

Policy-enforced execution. Simulator runs accept a SimulationPolicy with watchdog timeouts, per-process memory limits where supported, maximum retained log bytes, allowed write roots, and protected verification-asset tokens. The mock, Icarus, and Cocotb adapters return a structured SimulationReport in raw_artifacts, with stdout parsing marked as fallback evidence.

JSONL trace persistence. Append-only, grep-able, replayable. No ORM ceremony for what is fundamentally a log.

The trajectory

For each case the agent executes:

Baseline simulation on broken RTL — capture the failure signature
Log analysis — filter for UVM_ERROR, ASSERTION FAILED, FATAL
RTL inspection — scan for configured bug signatures
Fix proposal
Re-run — measure coverage delta and final reward

Reward

R_total = w_rc·R_root_cause
        + w_eq·R_evidence
        + w_pr·R_prm_mean
        + w_fp·R_fix_plausibility
        + w_tu·R_tool_use
        − Σ penalties

Weights sum to 1.0, asserted at module load. Penalties fire on protocol-level violations. PRM mean injects per-step process reward so trajectory-level scoring is sensitive to reasoning quality, not just final outcome.

Dataset

200 cases across four bug families, generated from hand-authored blueprints against the discriminated union schemas. Each case validates end-to-end through the harness before inclusion. Families cover the four primitives of digital design — storage, protocol, sequential, concurrent.

Family	Cases	Tests
FIFO buffers	50	Pointer arithmetic, full/empty flag races, overflow/underflow
AXI-Lite	50	Handshake (valid/ready ordering), address phase, response codes
FSM controllers	50	Transitions, stuck states, encoding width, default-case latches
Round-robin	50	Fairness, sticky grants, last-granted rotation

Each case generates one DPO preference pair (chosen fix vs rejected fix). 200 pairs is the floor for QLoRA + DPO on a 7B base — enough to measurably shift behavior without overfitting to a single bug class.

Stack

Layer	Choice	Why
Orchestrator	Python 3.12 + FastAPI	Async, boring, fast to ship
Packaging	uv	Fast resolves, lockfile reproducibility
Schemas	Pydantic v2	Discriminated unions, strict validation
Sim (free)	Icarus Verilog	Real metal, no license
Sim (Python)	Cocotb + pyuvm	Pythonic UVM, integrates directly
Preference learning	PyTorch (DPO)	Offline, no reward model to train, no rollouts

Install

sudo apt update && sudo apt install iverilog -y
curl -LsSf https://astral.sh/uv/install.sh | sh
cd backend
uv sync

Run

uv run smoke_test.py
uv run pytest
uvx ruff check .
uv run python scripts/run_demo_trajectories.py

Expected:

AXI valid drops before ready...     R_Total: 0.99  ok
FSM stuck in IDLE...                R_Total: 0.99  ok
UART FIFO overflow write...         R_Total: 0.98  ok

Suite complete. Results saved to smoke_test_results.json

Generating Demo Traces

To generate live LLM trajectories (requires OPENAI_API_KEY):

cd backend
OPENAI_API_KEY=sk-... PYTHONPATH=. uv run python scripts/run_demo_trajectories.py

Traces are saved to backend/traces/demo/*.openai.trajectory.json.

What's working

Adapter boundary with Mock active, Icarus implemented, and Cocotb/pyuvm scaffolded behind the same simulator interface
Discriminated union schemas for FIFO / FSM / arbiter / AXI-Lite cases
Family-level design pattern schemas for canonical FIFO / FSM / arbiter structure
Reward engine: 5-component decomposition + PRM mean injection, weight invariant asserted at load, regex word-boundary substring matching, clamped at zero
Safety layer: workspace diff audit, path-scope audit, absolute-path/traversal/symlink-escape rejection, tripwire audit, structured simulation reports, bounded retained logs, simulator watchdog policy, and process memory-limit hooks
R₂ holdout telemetry and R₁/R₂ gap tracking for threshold-hugging detection
Unbiased pass@k estimator for multi-sample evaluation reporting
Supabase persistence verified against eval_runs, with JSONL fallback for local/offline runs
DPO primitives: preference-pair generation and beta-regularized DPO loss
GitHub Actions CI for locked install, pytest, and ruff
Reward, schema, storage, workspace-audit, metrics, and safety test suite

Real-LLM Integration (v1)

The harness supports a live OpenAI backend alongside the deterministic baseline:

from app.services.agent_runner import run_agent_on_case

trajectory = run_agent_on_case(
    case,
    agent_backend="openai",
    model="gpt-4o",
    simulator_name="mock",
)

The DockerAgentBackend routes the same call through a hardened per-trajectory container:

trajectory = run_agent_on_case(
    case,
    agent_backend="docker",
    model="dv-eval-harness:trajectory",
    simulator_name="mock",
)

Container hardening posture: --read-only, --cap-drop=ALL, --security-opt=no-new-privileges, --pids-limit=100, --tmpfs /tmp:size=64M, --tmpfs /workspace:size=128M,exec, --memory={limit}m, --cpus=1.0. The API key is injected as an env var reference — never written to the filesystem or included in trajectory JSON.

Every LLM call is logged with endpoint, model, token counts, and wall-clock duration in submission.metadata. The audit trail is deterministic: same case + same model produces the same metadata structure.

Bring-your-own-model: LLMAgent is the base class. Swap the provider by subclassing and overriding step(). TrajectoryAgent (OpenAI) and GeneratorAgent (case synthesis) are the two v1 implementations.

Programmatic Case Generation

Current (v1): blueprint-based deterministic generator. Each bug family has a typed blueprint (FIFOBlueprint, FSMBlueprint, ArbiterBlueprint, AXIBlueprint) that parameterizes bug injection — pointer width, state encoding, grant policy, handshake violation type. The generator produces validated DVCase instances that pass full schema checks before inclusion.

v2 planned: mutation-rule based generation from source RTL (CVA6, RTLLM corpus). A rule describes a structural transformation (e.g. blocking-to-nonblocking assignment, missing reset branch, off-by-one in pointer arithmetic). An AST-aware mutator applies the rule to real RTL and produces a case with a ground-truth root cause. An LLM verifies the testbench catches the injected bug before the case is admitted. This eliminates hand-authored blueprints as the scale ceiling.

v2 Roadmap

Network egress lockdown — Docker network policy to block outbound connections from inside the trajectory container; model calls route through a pinned sidecar proxy
AST-aware semantic mutator — tree-sitter or slang parser; mutations preserve syntactic validity and target semantically meaningful constructs (assignments, resets, port directions, state transitions)
LLM-assisted mutation generation — use a generator model to propose mutation rules from RTL, filter by testbench catch rate; replaces hand-authored blueprints
On-prem / customer-deployed model adapter — LLMAgent subclass targeting a local inference endpoint; allowlist extended to private HTTPS hosts
Multi-model comparative trajectory evaluation — run N backends on the same case, compare r_total distributions; feed divergence into DPO pair selection

What's next

Validate Cocotb + pyuvm with executable Python testbenches and make it selectable from the API/CLI
Wire family-level design pattern schemas into structural conformance scoring beyond substring checks
Replace stdout fallback parsing with simulator-native XML/JSON/UCDB ingestion where available
Expand trajectory audit layer for deeper forensic gaming detection (CoT-action coherence, fix-before-evidence, hallucinated citations, conditional-independence violations)
QLoRA + DPO fine-tune on Mistral 7B / Llama 3 8B against harness-generated preference data
Questa and VCS adapters
Next.js dashboard for Supabase-backed trajectory leaderboard

Status

Built as a focused demonstration of the eval harness layer for DV agents. Architecture decisions are deliberate; coverage is intentionally narrow (four bug families) to ship a working end-to-end loop before scaling cases. The roadmap items are not vaporware — each one names a specific failure mode in the current implementation that the upgrade addresses.

Anthony Eugene Lewallen End-to-End AI Systems Engineer · Model Internals → MLOps + Agentic Systems
From the Metal to the Agent Level

B.S. Mathematics Operations Research, Summa Cum Laude — American Public University
MAS-CS (Software Systems) + MSE-AI — University of Pennsylvania

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
backend		backend
docker		docker
.codex		.codex
.gitignore		.gitignore
Bug_List.md		Bug_List.md
HowToWork.md		HowToWork.md
IMPLEMENTATION_DIRECTION.md		IMPLEMENTATION_DIRECTION.md
README.md		README.md
implementation.md		implementation.md
new_spec.md		new_spec.md
state.md		state.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dv-eval-harness

What it does

Why the design looks like this

The trajectory

Reward

Dataset

Stack

Install

Run

Generating Demo Traces

What's working

Real-LLM Integration (v1)

Programmatic Case Generation

v2 Roadmap

What's next

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

dv-eval-harness

What it does

Why the design looks like this

The trajectory

Reward

Dataset

Stack

Install

Run

Generating Demo Traces

What's working

Real-LLM Integration (v1)

Programmatic Case Generation

v2 Roadmap

What's next

Status

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages