Post-hoc calibration without retraining for large language models. This toolkit turns a raw prompt into:
- a bounded hallucination risk using the Expectation-level Decompression Law (EDFL), and
- a decision to ANSWER or REFUSE under a target SLA, with transparent math (nats).
- Multi-Provider Support: Works with OpenAI, Anthropic (Claude), Hugging Face, and Ollama models
- No Retraining Required: Pure inference-time calibration
- Two Deployment Modes:
- Evidence-based: prompts include evidence/context; rolling priors are built by erasing that evidence
- Closed-book: prompts have no evidence; rolling priors are built by semantic masking of entities/numbers/titles
- Mathematically Grounded: Based on EDFL/B2T/ISR framework from NeurIPS 2024 preprint
- Install & Setup
- Supported Model Providers
- Quick Start Examples
- Core Mathematical Framework
- Understanding System Behavior
- Two Ways to Build Rolling Priors
- API Surface
- Calibration & Validation
- Practical Considerations
- Project Layout
- Deployment Options
# Core requirement
pip install --upgrade openai
# For additional providers (optional)
pip install anthropic # For Claude models
pip install transformers torch # For local Hugging Face models
pip install ollama # For Ollama models
pip install requests # For HTTP-based backends
# For OpenAI
export OPENAI_API_KEY=sk-...
# For Anthropic (Claude)
export ANTHROPIC_API_KEY=sk-ant-...
# For Hugging Face Inference API
export HF_TOKEN=hf_...
The toolkit now supports multiple LLM providers through universal backend adapters:
- GPT-4o, GPT-4o-mini, and other Chat Completions models
- Requires
OPENAI_API_KEY
- Claude 3.5 Sonnet, Claude 3 Opus, and other Claude models
- Requires
anthropic
package andANTHROPIC_API_KEY
- Local Transformers: Run models locally with
transformers
- TGI Server: Connect to Text Generation Inference servers
- Inference API: Use hosted models via Hugging Face API
- Run any Ollama-supported model locally
- Supports both Python SDK and HTTP API
- Single API for 100+ models from OpenAI, Anthropic, Google, Meta, and more
- Automatic fallbacks and load balancing
- Often cheaper than direct API access due to volume aggregation
- Built-in rate limiting and retry logic
from hallucination_toolkit import OpenAIBackend, OpenAIItem, OpenAIPlanner
backend = OpenAIBackend(model="gpt-4o-mini")
planner = OpenAIPlanner(backend, temperature=0.3)
item = OpenAIItem(
prompt="Who won the 2019 Nobel Prize in Physics?",
n_samples=7,
m=6,
skeleton_policy="closed_book"
)
metrics = planner.run(
[item],
h_star=0.05, # Target 5% hallucination max
isr_threshold=1.0, # Standard ISR gate
margin_extra_bits=0.2, # Safety margin
B_clip=12.0, # Clipping bound
clip_mode="one-sided" # Conservative mode
)
for m in metrics:
print(f"Decision: {'ANSWER' if m.decision_answer else 'REFUSE'}")
print(f"Risk bound: {m.roh_bound:.3f}")
from hallucination_toolkit import OpenAIPlanner, OpenAIItem
from htk_backends import AnthropicBackend
# Use Claude instead of GPT
backend = AnthropicBackend(model="claude-3-5-sonnet-latest")
planner = OpenAIPlanner(backend, temperature=0.3)
# Rest of the code remains identical
items = [OpenAIItem(prompt="What is quantum entanglement?", n_samples=7, m=6)]
metrics = planner.run(items, h_star=0.05)
from hallucination_toolkit import OpenAIPlanner, OpenAIItem
from htk_backends import HuggingFaceBackend
# Run Llama locally
backend = HuggingFaceBackend(
mode="transformers",
model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
device_map="auto" # or "cuda" or "cpu"
)
planner = OpenAIPlanner(backend, temperature=0.3)
# Same evaluation flow
metrics = planner.run([...], h_star=0.05)
from htk_backends import HuggingFaceBackend
# Connect to a Text Generation Inference server
backend = HuggingFaceBackend(
mode="tgi",
tgi_url="http://localhost:8080"
)
planner = OpenAIPlanner(backend, temperature=0.3)
from htk_backends import HuggingFaceBackend
import os
# Use Hugging Face's hosted models
backend = HuggingFaceBackend(
mode="inference_api",
model_id="mistralai/Mistral-7B-Instruct-v0.3",
hf_token=os.environ["HF_TOKEN"]
)
planner = OpenAIPlanner(backend, temperature=0.3)
OpenRouter provides access to 100+ models through a single API, making it ideal for comparing hallucination bounds across providers:
from hallucination_toolkit import OpenAIPlanner, OpenAIItem
from htk_backends import OpenRouterBackend
# Access any model through OpenRouter's unified API
backend = OpenRouterBackend(
model="openrouter/auto", # Auto-selects best available model
# model="anthropic/claude-3.5-sonnet", # Or specify exact model
# api_key="...", # Uses OPENROUTER_API_KEY env var if not provided
http_referer="https://your.app", # Optional but recommended
x_title="EDFL Decision Head (prod)", # Optional app identifier
providers={"allow": ["anthropic", "google", "openai"]}, # Optional: limit providers
)
planner = OpenAIPlanner(
backend=backend,
temperature=0.5,
max_tokens_decision=8, # Tiny JSON decision head
q_floor=None, # Or set your prior floor
)
items = [OpenAIItem(
prompt="What is quantum entanglement?",
n_samples=3,
m=6,
skeleton_policy="auto"
)]
metrics = planner.run(
items,
h_star=0.05,
isr_threshold=1.0,
B_clip=12.0,
clip_mode="one-sided"
)
for m in metrics:
print(f"Decision: {'ANSWER' if m.decision_answer else 'REFUSE'}")
print(f"ISR: {m.isr:.3f}, RoH bound: {m.roh_bound:.3f}")
Why OpenRouter for this toolkit?
- Test calibration across many models without managing multiple API keys
- Automatic fallbacks ensure high availability for production deployments
- Cost optimization through intelligent routing
- Perfect for A/B testing different models' hallucination characteristics
from htk_backends import OllamaBackend
# Use any Ollama model
backend = OllamaBackend(
model="llama3.1:8b-instruct",
host="http://localhost:11434" # Default Ollama port
)
planner = OpenAIPlanner(backend, temperature=0.3)
AnthropicBackend(
model="claude-3-5-sonnet-latest", # or any Claude model
api_key=None, # Uses ANTHROPIC_API_KEY env var if None
request_timeout=60.0
)
Requirements: pip install anthropic
OpenRouterBackend(
model="openrouter/auto", # Auto-routing or specific model
api_key=None, # Uses OPENROUTER_API_KEY env var
http_referer="https://your.app", # Recommended for tracking
x_title="Your App Name", # Optional identifier
providers={"allow": ["anthropic", "google"]}, # Optional filtering
)
Requirements: pip install openai
(OpenRouter uses OpenAI-compatible API)
Available models include:
anthropic/claude-3.5-sonnet
,openai/gpt-4-turbo
,google/gemini-pro
meta-llama/llama-3-70b-instruct
,mistralai/mixtral-8x7b
- See OpenRouter models for full list
The Hugging Face backend supports three operational modes:
HuggingFaceBackend(
mode="transformers",
model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
device_map="auto", # GPU allocation strategy
torch_dtype="float16", # Optional: precision setting
trust_remote_code=True, # For custom model code
model_kwargs={} # Additional model parameters
)
Requirements: pip install transformers torch
HuggingFaceBackend(
mode="tgi",
tgi_url="http://localhost:8080", # Your TGI server URL
model_id=None # Not needed for TGI
)
Requirements: pip install requests
and a running TGI server
HuggingFaceBackend(
mode="inference_api",
model_id="mistralai/Mistral-7B-Instruct-v0.3",
hf_token="hf_..." # Your Hugging Face token
)
Requirements: pip install requests
and a Hugging Face account
OllamaBackend(
model="llama3.1:8b-instruct", # Any Ollama model
host="http://localhost:11434", # Ollama server URL
request_timeout=60.0
)
Requirements: pip install ollama
(optional) or pip install requests
, and Ollama installed locally
Here's a complete example comparing different providers on the same prompt:
from hallucination_toolkit import OpenAIPlanner, OpenAIItem
from htk_backends import AnthropicBackend, HuggingFaceBackend, OllamaBackend
# Define test prompt
prompt = "What are the main differences between quantum and classical computing?"
item = OpenAIItem(prompt=prompt, n_samples=5, m=6, skeleton_policy="closed_book")
# Test configuration
config = dict(
h_star=0.05,
isr_threshold=1.0,
margin_extra_bits=0.2,
B_clip=12.0,
clip_mode="one-sided"
)
# Compare providers
providers = {
"GPT-4o-mini": OpenAIBackend(model="gpt-4o-mini"),
"Claude-3.5": AnthropicBackend(model="claude-3-5-sonnet-latest"),
"Llama-3.1": HuggingFaceBackend(mode="transformers", model_id="meta-llama/Meta-Llama-3.1-8B-Instruct"),
"Ollama": OllamaBackend(model="llama3.1:8b-instruct")
}
results = {}
for name, backend in providers.items():
try:
planner = OpenAIPlanner(backend, temperature=0.3)
metrics = planner.run([item], **config)
results[name] = metrics[0]
print(f"{name}: {'ANSWER' if metrics[0].decision_answer else 'REFUSE'} (RoH={metrics[0].roh_bound:.3f})")
except Exception as e:
print(f"{name}: Error - {e}")
Let the binary event
Build an ensemble of content-weakened prompts (the rolling priors)
-
Information budget:
$$\bar{\Delta} = \tfrac{1}{m}\sum_k \mathrm{clip}_+(\log P(y) - \log S_k(y), B)$$ (one-sided clipping; default$B=12$ nats to prevent outliers while maintaining conservative bounds). -
Prior masses:
$q_k = S_k(\mathcal{A})$ , with:-
$\bar{q}=\tfrac{1}{m}\sum_k q_k$ (average prior for EDFL bound) -
$q_{\text{lo}}=\min_k q_k$ (worst-case prior for SLA gating)
-
By EDFL, the achievable reliability is bounded by:
Thus the hallucination risk (error) is bounded by
For target hallucination rate
-
Bits-to-Trust:
$\mathrm{B2T} = \mathrm{KL}(\mathrm{Ber}(1-h^*) | \mathrm{Ber}(q_{\text{lo}}))$ -
Information Sufficiency Ratio:
$\mathrm{ISR} = \bar{\Delta}/\mathrm{B2T}$ -
ANSWER iff
$\mathrm{ISR}\ge 1$ and$\bar{\Delta} \ge \mathrm{B2T} + \text{margin}$ (defaultmargin≈0.2
nats)
Why two priors? The gate uses worst-case
$q_{\text{lo}}$ for strict SLA compliance. The RoH bound uses average$\bar{q}$ per EDFL theory. This dual approach ensures conservative safety while providing realistic risk bounds.
The toolkit exhibits different behaviors across query types, which is mathematically consistent with the framework:
Observation: May abstain despite apparent simplicity
Explanation:
- Models often attempt answers even with masked numbers (pattern recognition)
- This yields low information lift
$\bar{\Delta} \approx 0$ between full prompt and skeletons - Despite potentially low EDFL risk bound, worst-case prior gate triggers abstention (ISR < 1)
Observation: Generally answered with confidence
Explanation:
- Masking entities/dates substantially reduces answer propensity in skeletons
- Restoring these yields large
$\bar{\Delta}$ that clears B2T threshold - System answers with tight EDFL risk bound
This is not a bug but a feature: The framework prioritizes safety through worst-case guarantees while providing realistic average-case bounds.
Different model providers may exhibit varying behaviors:
- OpenAI/Anthropic: Generally produce clean JSON decisions with high compliance
- Hugging Face (Local): May require instruction-tuned variants for best results
- Ollama: Performance depends on the specific model; instruction-tuned models recommended
- Base Models: May need adjusted prompting or higher sampling for stable priors
- Prompt contains a field like
Evidence:
(or JSON keys) - Skeletons erase the evidence content but preserve structure and roles; then permute blocks deterministically (seeded)
- Decision head: "Answer only if the provided evidence is sufficient; otherwise refuse."
Example with Multiple Providers
from hallucination_toolkit import OpenAIItem, OpenAIPlanner
from htk_backends import AnthropicBackend
backend = AnthropicBackend(model="claude-3-5-sonnet-latest")
prompt = """Task: Answer strictly based on the evidence below.
Question: Who won the Nobel Prize in Physics in 2019?
Evidence:
- Nobel Prize press release (2019): James Peebles (1/2); Michel Mayor & Didier Queloz (1/2).
Constraints: If evidence is insufficient or conflicting, refuse.
"""
item = OpenAIItem(
prompt=prompt,
n_samples=5,
m=6,
fields_to_erase=["Evidence"],
skeleton_policy="auto"
)
planner = OpenAIPlanner(backend, temperature=0.3)
metrics = planner.run([item], h_star=0.05, isr_threshold=1.0)
- Prompt has no evidence
- Skeletons apply semantic masking of:
- Multi-word proper nouns (e.g., "James Peebles" → "[…]")
- Years (e.g., "2019" → "[…]")
- Numbers (e.g., "3.14" → "[…]")
- Quoted spans (e.g., '"Nobel Prize"' → "[…]")
- Masking strengths: Progressive levels (0.25, 0.35, 0.5, 0.65, 0.8, 0.9) across skeleton ensemble
Example with Multiple Providers
from hallucination_toolkit import OpenAIItem, OpenAIPlanner
from htk_backends import OllamaBackend
backend = OllamaBackend(model="mixtral:8x7b-instruct")
item = OpenAIItem(
prompt="Who won the 2019 Nobel Prize in Physics?",
n_samples=7,
m=6,
skeleton_policy="closed_book"
)
planner = OpenAIPlanner(backend, temperature=0.3)
metrics = planner.run([item], h_star=0.05)
OpenAIBackend(model, api_key=None)
– Original OpenAI wrapperAnthropicBackend(model, api_key=None)
– Anthropic Claude adapterHuggingFaceBackend(mode, model_id, ...)
– Hugging Face adapter (3 modes)OllamaBackend(model, host)
– Ollama local model adapterOpenAIItem(prompt, n_samples=5, m=6, fields_to_erase=None, skeleton_policy="auto")
– One evaluation itemOpenAIPlanner(backend, temperature=0.5, q_floor=None)
– Runs evaluation (works with any backend):run(items, h_star, isr_threshold, margin_extra_bits, B_clip=12.0, clip_mode="one-sided") -> List[ItemMetrics]
aggregate(items, metrics, alpha=0.05, h_star, ...) -> AggregateReport
make_sla_certificate(report, model_name)
– Creates formal SLA certificatesave_sla_certificate_json(cert, path)
– Exports certificate for auditgenerate_answer_if_allowed(backend, item, metric)
– Only emits answer if decision was ANSWER
Every ItemMetrics
includes:
-
delta_bar
: Information budget (nats) -
q_conservative
: Worst-case prior$q_{\text{lo}}$ -
q_avg
: Average prior$\bar{q}$ -
b2t
: Bits-to-Trust requirement -
isr
: Information Sufficiency Ratio -
roh_bound
: EDFL hallucination risk bound -
decision_answer
: Boolean decision -
rationale
: Human-readable explanation -
meta
: Dict withq_list
,S_list_y
,P_y
,closed_book
, etc.
On a labeled validation set:
- Sweep the margin parameter from 0 to 1 nats
- For each margin, compute:
- Empirical hallucination rate among answered items
- Wilson upper bound at 95% confidence
-
Select smallest margin where Wilson upper bound ≤ target
$h^*$ (e.g., 5%) -
Freeze policy:
$(h^*, \tau, \text{margin}, B, \text{clip_mode}, m, r, \text{skeleton_policy})$
The toolkit provides comprehensive metrics:
- Answer/abstention rates
- Empirical hallucination rate + Wilson bound
- Distribution of per-item EDFL RoH bounds
- Worst-case and median risk bounds
- Complete audit trail
Provider | Best For | Considerations |
---|---|---|
OpenAI | Production deployment, consistent JSON | Requires API key, costs per token |
Anthropic | High-quality reasoning, safety-critical | Requires API key, may have rate limits |
OpenRouter | Multi-model testing, cost optimization | Single API for 100+ models, automatic fallbacks |
HuggingFace (Local) | Full control, no API costs | Requires GPU, setup complexity |
HuggingFace (TGI) | Team deployments, caching | Requires server setup |
HuggingFace (API) | Quick prototyping | Rate limits, requires HF token |
Ollama | Local experimentation | Easy setup, model quality varies |
Provider | Latency per Item | Cost | Setup Complexity |
---|---|---|---|
OpenAI | 2-5 seconds | ~$0.01-0.03 | Low |
Anthropic | 3-6 seconds | ~$0.02-0.05 | Low |
HF Local | 1-10 seconds | Free (GPU cost) | Medium-High |
HF TGI | 1-3 seconds | Server costs | High |
HF API | 3-8 seconds | Free tier/paid | Low |
Ollama | 2-15 seconds | Free (local) | Low |
Solution: Use instruction-tuned model variants (e.g., -Instruct
suffixes)
Expected: Models have different knowledge/calibration; the framework adapts accordingly
Solution: Increase request_timeout
parameter or reduce batch size
.
├── app/ # Application entry points
│ ├── web/web_app.py # Streamlit UI
│ ├── cli/frontend.py # Interactive CLI
│ ├── examples/ # Example scripts
│ └── launcher/entry.py # Unified launcher
├── hallbayes/ # Core modules
│ ├── hallucination_toolkit.py # Main toolkit
│ ├── htk_backends.py # Universal backend adapters
│ └── build_offline_backend.sh
├── electron/ # Desktop wrapper
├── launch/ # Platform launchers
├── release/ # Packaged artifacts
├── bin/ # Offline backend binary
├── requirements.txt
├── pyproject.toml
└── README.md
from hallbayes import OpenAIPlanner, OpenAIItem, make_sla_certificate
from hallbayes.htk_backends import AnthropicBackend # or any other backend
# Choose your provider
backend = AnthropicBackend(model="claude-3-5-sonnet-latest")
# Configure and run
items = [OpenAIItem(prompt="...", n_samples=7, m=6)]
planner = OpenAIPlanner(backend, temperature=0.3)
metrics = planner.run(items, h_star=0.05)
# Generate SLA certificate
report = planner.aggregate(items, metrics)
cert = make_sla_certificate(report, model_name="Claude-3.5-Sonnet")
save_sla_certificate_json(cert, "sla.json")
streamlit run app/web/web_app.py
from hallucination_toolkit import OpenAIPlanner, OpenAIItem
from htk_backends import AnthropicBackend, OllamaBackend
import json
# Load prompts
with open("prompts.json") as f:
prompts = json.load(f)
# Setup providers
providers = {
"claude": AnthropicBackend(model="claude-3-5-sonnet-latest"),
"llama": OllamaBackend(model="llama3.1:8b-instruct")
}
# Process with each provider
results = {}
for name, backend in providers.items():
planner = OpenAIPlanner(backend, temperature=0.3)
items = [OpenAIItem(prompt=p, n_samples=5, m=6) for p in prompts]
metrics = planner.run(items, h_star=0.05)
results[name] = planner.aggregate(items, metrics)
If you're already using the toolkit with OpenAI, here's how to try other providers:
# Original (OpenAI only)
from hallucination_toolkit import OpenAIBackend
backend = OpenAIBackend(model="gpt-4o-mini")
# New (Any provider) - just change these two lines:
from htk_backends import AnthropicBackend # or HuggingFaceBackend, OllamaBackend
backend = AnthropicBackend(model="claude-3-5-sonnet-latest")
# Everything else stays exactly the same!
planner = OpenAIPlanner(backend, temperature=0.3)
# ... rest of your code unchanged
Based on the Paper: Predictable Compression Failures: Why Language Models Actually Hallucinate - https://arxiv.org/abs/2509.11208
This project is licensed under the MIT License – see the LICENSE file for details.
Developed by Hassana Labs (https://hassana.io).