A Neuro-Symbolic Paradigm for Agentic Oversight
This repository contains the official implementation for the paper "FormalJudge: A Neuro-Symbolic Paradigm for Agentic Oversight" by Jiayi Zhou, Hantao Lou, Yang Sheng, Jie Fu, and Yaodong Yang.
FormalJudge is a neuro-symbolic framework that combines formal verification with large language models for robust agent safety evaluation. Unlike traditional LLM-as-a-Judge approaches, FormalJudge leverages formal specifications (Dafny) to provide verifiable safety guarantees.
- 16.6% average improvement over LLM-as-a-Judge baselines across 3 benchmarks and 7 agent models
- Weak-to-strong detection: A 7B judge model achieves >90% accuracy detecting deceptive behaviors from 72B agents
- Stable performance across varying attack sophistication levels
- Near-linear safety improvement through iterative refinement: agents improve from 70.7% to 99.8% accuracy over three rounds with Dafny specification feedback
FormalJudge/
├── AgentSafetyBench/ # Behavioral safety benchmark (available now)
│ ├── data/ # Benchmark dataset
│ ├── environments/ # 350+ agent environment simulators
│ ├── evaluation/ # Agent evaluation framework
│ ├── formal_verification/ # Formal verification pipeline
│ ├── iterative_refinement/ # Iterative refinement with feedback
│ └── score/ # Safety scoring module
├── VitaBench/ # Multi-domain constraint adherence (coming soon)
└── UpwardDeceivers/ # Deception detection benchmark (coming soon)
| Benchmark | Focus | Status |
|---|---|---|
| AgentSafetyBench | Behavioral safety evaluation | Available |
| VitaBench | Multi-domain constraint adherence | Coming soon |
| UpwardDeceivers | Deception detection | Coming soon |
pip install -r AgentSafetyBench/requirements.txtcd AgentSafetyBench/evaluation
bash eval.shcd AgentSafetyBench/formal_verification
bash scripts/run_dafny.shcd AgentSafetyBench/iterative_refinement
bash run_api_refinement.sh claude-opus-4-5 3 8 dafnyFormalJudge employs a hierarchical 3-agent pipeline for formal verification:
- Agent #1 (NL Decomposition): Breaks down high-level safety requirements into verifiable sub-properties
- Agent #2 (Formal Translation): Translates requirements into Dafny/Python/NL specifications
- Agent #3 (Trace Abstraction): Extracts concrete values from agent execution traces
- Executor: Compiles and runs formal verification to produce YES/NO verdicts
dafny: Full formal verification with Dafny specificationspython: Python-based verification (without formal proofs)natural_language: Natural language judgmentllm_cot: LLM with chain-of-thought reasoningllm_fewshot: LLM with few-shot examples
FormalJudge supports evaluation of multiple agent models:
API Models: Claude (Opus, Sonnet), GPT-4/5, Gemini, DeepSeek
Local Models: Qwen (7B-72B), Llama 3, GLM-4
Citation information will be updated after the paper is posted to arXiv.
This project is released under the MIT License. See individual benchmark directories for specific licensing information.