Hallucination Risk Calculator & Prompt Re-engineering Toolkit (Multi-Provider Support)

Post-hoc calibration without retraining for large language models. This toolkit turns a raw prompt into:

a bounded hallucination risk using the Expectation-level Decompression Law (EDFL), and
a decision to ANSWER or REFUSE under a target SLA, with transparent math (nats).

🎯 Key Features

Multi-Provider Support: Works with OpenAI, Anthropic (Claude), Hugging Face, and Ollama models
No Retraining Required: Pure inference-time calibration
Two Deployment Modes:
- Evidence-based: prompts include evidence/context; rolling priors are built by erasing that evidence
- Closed-book: prompts have no evidence; rolling priors are built by semantic masking of entities/numbers/titles
Mathematically Grounded: Based on EDFL/B2T/ISR framework from NeurIPS 2024 preprint

Install & Setup

Basic Installation

# Core requirement
pip install --upgrade openai

# For additional providers (optional)
pip install anthropic              # For Claude models
pip install transformers torch     # For local Hugging Face models
pip install ollama                 # For Ollama models
pip install requests               # For HTTP-based backends

API Keys Setup

# For OpenAI
export OPENAI_API_KEY=sk-...

# For Anthropic (Claude)
export ANTHROPIC_API_KEY=sk-ant-...

# For Hugging Face Inference API
export HF_TOKEN=hf_...

Supported Model Providers

The toolkit now supports multiple LLM providers through universal backend adapters:

1. OpenAI (Original)

GPT-4o, GPT-4o-mini, and other Chat Completions models
Requires OPENAI_API_KEY

2. Anthropic (Claude)

Claude 3.5 Sonnet, Claude 3 Opus, and other Claude models
Requires anthropic package and ANTHROPIC_API_KEY

3. Hugging Face (Three Modes)

Local Transformers: Run models locally with transformers
TGI Server: Connect to Text Generation Inference servers
Inference API: Use hosted models via Hugging Face API

4. Ollama

Run any Ollama-supported model locally
Supports both Python SDK and HTTP API

5. OpenRouter (Recommended for Production)

Single API for 100+ models from OpenAI, Anthropic, Google, Meta, and more
Automatic fallbacks and load balancing
Often cheaper than direct API access due to volume aggregation
Built-in rate limiting and retry logic

Quick Start Examples

Using OpenAI (Original)

from hallucination_toolkit import OpenAIBackend, OpenAIItem, OpenAIPlanner

backend = OpenAIBackend(model="gpt-4o-mini")
planner = OpenAIPlanner(backend, temperature=0.3)

item = OpenAIItem(
    prompt="Who won the 2019 Nobel Prize in Physics?",
    n_samples=7,
    m=6,
    skeleton_policy="closed_book"
)

metrics = planner.run(
    [item], 
    h_star=0.05,           # Target 5% hallucination max
    isr_threshold=1.0,     # Standard ISR gate
    margin_extra_bits=0.2, # Safety margin
    B_clip=12.0,          # Clipping bound
    clip_mode="one-sided" # Conservative mode
)

for m in metrics:
    print(f"Decision: {'ANSWER' if m.decision_answer else 'REFUSE'}")
    print(f"Risk bound: {m.roh_bound:.3f}")

Using Anthropic (Claude)

from hallucination_toolkit import OpenAIPlanner, OpenAIItem
from htk_backends import AnthropicBackend

# Use Claude instead of GPT
backend = AnthropicBackend(model="claude-3-5-sonnet-latest")
planner = OpenAIPlanner(backend, temperature=0.3)

# Rest of the code remains identical
items = [OpenAIItem(prompt="What is quantum entanglement?", n_samples=7, m=6)]
metrics = planner.run(items, h_star=0.05)

Using Hugging Face (Local)

from hallucination_toolkit import OpenAIPlanner, OpenAIItem
from htk_backends import HuggingFaceBackend

# Run Llama locally
backend = HuggingFaceBackend(
    mode="transformers",
    model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
    device_map="auto"  # or "cuda" or "cpu"
)
planner = OpenAIPlanner(backend, temperature=0.3)

# Same evaluation flow
metrics = planner.run([...], h_star=0.05)

Using Hugging Face (TGI Server)

from htk_backends import HuggingFaceBackend

# Connect to a Text Generation Inference server
backend = HuggingFaceBackend(
    mode="tgi",
    tgi_url="http://localhost:8080"
)
planner = OpenAIPlanner(backend, temperature=0.3)

Using Hugging Face (Inference API)

from htk_backends import HuggingFaceBackend
import os

# Use Hugging Face's hosted models
backend = HuggingFaceBackend(
    mode="inference_api",
    model_id="mistralai/Mistral-7B-Instruct-v0.3",
    hf_token=os.environ["HF_TOKEN"]
)
planner = OpenAIPlanner(backend, temperature=0.3)

Using OpenRouter (Recommended for Multi-Model Testing)

OpenRouter provides access to 100+ models through a single API, making it ideal for comparing hallucination bounds across providers:

from hallucination_toolkit import OpenAIPlanner, OpenAIItem
from htk_backends import OpenRouterBackend

# Access any model through OpenRouter's unified API
backend = OpenRouterBackend(
    model="openrouter/auto",                  # Auto-selects best available model
    # model="anthropic/claude-3.5-sonnet",    # Or specify exact model
    # api_key="...",                          # Uses OPENROUTER_API_KEY env var if not provided
    http_referer="https://your.app",          # Optional but recommended
    x_title="EDFL Decision Head (prod)",      # Optional app identifier
    providers={"allow": ["anthropic", "google", "openai"]},  # Optional: limit providers
)

planner = OpenAIPlanner(
    backend=backend,
    temperature=0.5,
    max_tokens_decision=8,     # Tiny JSON decision head
    q_floor=None,              # Or set your prior floor
)

items = [OpenAIItem(
    prompt="What is quantum entanglement?", 
    n_samples=3, 
    m=6, 
    skeleton_policy="auto"
)]

metrics = planner.run(
    items, 
    h_star=0.05, 
    isr_threshold=1.0, 
    B_clip=12.0, 
    clip_mode="one-sided"
)

for m in metrics:
    print(f"Decision: {'ANSWER' if m.decision_answer else 'REFUSE'}")
    print(f"ISR: {m.isr:.3f}, RoH bound: {m.roh_bound:.3f}")

Why OpenRouter for this toolkit?

Test calibration across many models without managing multiple API keys
Automatic fallbacks ensure high availability for production deployments
Cost optimization through intelligent routing
Perfect for A/B testing different models' hallucination characteristics

Using Ollama

from htk_backends import OllamaBackend

# Use any Ollama model
backend = OllamaBackend(
    model="llama3.1:8b-instruct",
    host="http://localhost:11434"  # Default Ollama port
)
planner = OpenAIPlanner(backend, temperature=0.3)

Backend Configuration Details

AnthropicBackend

AnthropicBackend(
    model="claude-3-5-sonnet-latest",  # or any Claude model
    api_key=None,                       # Uses ANTHROPIC_API_KEY env var if None
    request_timeout=60.0
)

Requirements: pip install anthropic

OpenRouterBackend

OpenRouterBackend(
    model="openrouter/auto",           # Auto-routing or specific model
    api_key=None,                      # Uses OPENROUTER_API_KEY env var
    http_referer="https://your.app",   # Recommended for tracking
    x_title="Your App Name",           # Optional identifier
    providers={"allow": ["anthropic", "google"]},  # Optional filtering
)

Requirements: pip install openai (OpenRouter uses OpenAI-compatible API)

Available models include:

anthropic/claude-3.5-sonnet, openai/gpt-4-turbo, google/gemini-pro
meta-llama/llama-3-70b-instruct, mistralai/mixtral-8x7b
See OpenRouter models for full list

HuggingFaceBackend

The Hugging Face backend supports three operational modes:

Mode 1: Local Transformers

HuggingFaceBackend(
    mode="transformers",
    model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
    device_map="auto",           # GPU allocation strategy
    torch_dtype="float16",       # Optional: precision setting
    trust_remote_code=True,      # For custom model code
    model_kwargs={}              # Additional model parameters
)

Requirements: pip install transformers torch

Mode 2: TGI Server

HuggingFaceBackend(
    mode="tgi",
    tgi_url="http://localhost:8080",  # Your TGI server URL
    model_id=None                      # Not needed for TGI
)

Requirements: pip install requests and a running TGI server

Mode 3: Inference API

HuggingFaceBackend(
    mode="inference_api",
    model_id="mistralai/Mistral-7B-Instruct-v0.3",
    hf_token="hf_..."  # Your Hugging Face token
)

Requirements: pip install requests and a Hugging Face account

OllamaBackend

OllamaBackend(
    model="llama3.1:8b-instruct",  # Any Ollama model
    host="http://localhost:11434",  # Ollama server URL
    request_timeout=60.0
)

Requirements: pip install ollama (optional) or pip install requests, and Ollama installed locally

Comparing Providers

Here's a complete example comparing different providers on the same prompt:

from hallucination_toolkit import OpenAIPlanner, OpenAIItem
from htk_backends import AnthropicBackend, HuggingFaceBackend, OllamaBackend

# Define test prompt
prompt = "What are the main differences between quantum and classical computing?"
item = OpenAIItem(prompt=prompt, n_samples=5, m=6, skeleton_policy="closed_book")

# Test configuration
config = dict(
    h_star=0.05,
    isr_threshold=1.0,
    margin_extra_bits=0.2,
    B_clip=12.0,
    clip_mode="one-sided"
)

# Compare providers
providers = {
    "GPT-4o-mini": OpenAIBackend(model="gpt-4o-mini"),
    "Claude-3.5": AnthropicBackend(model="claude-3-5-sonnet-latest"),
    "Llama-3.1": HuggingFaceBackend(mode="transformers", model_id="meta-llama/Meta-Llama-3.1-8B-Instruct"),
    "Ollama": OllamaBackend(model="llama3.1:8b-instruct")
}

results = {}
for name, backend in providers.items():
    try:
        planner = OpenAIPlanner(backend, temperature=0.3)
        metrics = planner.run([item], **config)
        results[name] = metrics[0]
        print(f"{name}: {'ANSWER' if metrics[0].decision_answer else 'REFUSE'} (RoH={metrics[0].roh_bound:.3f})")
    except Exception as e:
        print(f"{name}: Error - {e}")

Core Mathematical Framework

The EDFL Principle

Let the binary event $\mathcal{A}$ be the thing you want to guarantee (e.g., Answer in decision mode, or Correct for factual accuracy).
Build an ensemble of content-weakened prompts (the rolling priors) ${S_k}_{k=1}^m$. For the realized label $y$, estimate:

Information budget:
$$\bar{\Delta} = \tfrac{1}{m}\sum_k \mathrm{clip}_+(\log P(y) - \log S_k(y), B)$$ (one-sided clipping; default $B=12$ nats to prevent outliers while maintaining conservative bounds).
Prior masses: $q_k = S_k(\mathcal{A})$, with:
- $\bar{q}=\tfrac{1}{m}\sum_k q_k$ (average prior for EDFL bound)
- $q_{\text{lo}}=\min_k q_k$ (worst-case prior for SLA gating)

By EDFL, the achievable reliability is bounded by:
$$\bar{\Delta} \ge \mathrm{KL}(\mathrm{Ber}(p) | \mathrm{Ber}(\bar{q})) \Rightarrow p\le p_{\max}(\bar{\Delta},\bar{q})$$

Thus the hallucination risk (error) is bounded by $\overline{\mathrm{RoH}} \le 1 - p_{\max}$.

Decision Rule (SLA Gating)

For target hallucination rate $h^*$:

Bits-to-Trust: $\mathrm{B2T} = \mathrm{KL}(\mathrm{Ber}(1-h^*) | \mathrm{Ber}(q_{\text{lo}}))$
Information Sufficiency Ratio: $\mathrm{ISR} = \bar{\Delta}/\mathrm{B2T}$
ANSWER iff $\mathrm{ISR}\ge 1$ and $\bar{\Delta} \ge \mathrm{B2T} + \text{margin}$ (default margin≈0.2 nats)

Why two priors? The gate uses worst-case $q_{\text{lo}}$ for strict SLA compliance. The RoH bound uses average $\bar{q}$ per EDFL theory. This dual approach ensures conservative safety while providing realistic risk bounds.

Understanding System Behavior

Expected Behavioral Patterns

The toolkit exhibits different behaviors across query types, which is mathematically consistent with the framework:

Simple Arithmetic Queries

Observation: May abstain despite apparent simplicity
Explanation:

Models often attempt answers even with masked numbers (pattern recognition)
This yields low information lift $\bar{\Delta} \approx 0$ between full prompt and skeletons
Despite potentially low EDFL risk bound, worst-case prior gate triggers abstention (ISR < 1)

Named-Entity Factoids

Observation: Generally answered with confidence
Explanation:

Masking entities/dates substantially reduces answer propensity in skeletons
Restoring these yields large $\bar{\Delta}$ that clears B2T threshold
System answers with tight EDFL risk bound

This is not a bug but a feature: The framework prioritizes safety through worst-case guarantees while providing realistic average-case bounds.

Provider-Specific Considerations

Different model providers may exhibit varying behaviors:

OpenAI/Anthropic: Generally produce clean JSON decisions with high compliance
Hugging Face (Local): May require instruction-tuned variants for best results
Ollama: Performance depends on the specific model; instruction-tuned models recommended
Base Models: May need adjusted prompting or higher sampling for stable priors

Two Ways to Build Rolling Priors

1) Evidence-based (when you have context)

Prompt contains a field like Evidence: (or JSON keys)
Skeletons erase the evidence content but preserve structure and roles; then permute blocks deterministically (seeded)
Decision head: "Answer only if the provided evidence is sufficient; otherwise refuse."

Example with Multiple Providers

from hallucination_toolkit import OpenAIItem, OpenAIPlanner
from htk_backends import AnthropicBackend

backend = AnthropicBackend(model="claude-3-5-sonnet-latest")
prompt = """Task: Answer strictly based on the evidence below.
Question: Who won the Nobel Prize in Physics in 2019?
Evidence:
- Nobel Prize press release (2019): James Peebles (1/2); Michel Mayor & Didier Queloz (1/2).
Constraints: If evidence is insufficient or conflicting, refuse.
"""

item = OpenAIItem(
    prompt=prompt, 
    n_samples=5, 
    m=6, 
    fields_to_erase=["Evidence"], 
    skeleton_policy="auto"
)

planner = OpenAIPlanner(backend, temperature=0.3)
metrics = planner.run([item], h_star=0.05, isr_threshold=1.0)

2) Closed-book (no evidence)

Prompt has no evidence
Skeletons apply semantic masking of:
- Multi-word proper nouns (e.g., "James Peebles" → "[…]")
- Years (e.g., "2019" → "[…]")
- Numbers (e.g., "3.14" → "[…]")
- Quoted spans (e.g., '"Nobel Prize"' → "[…]")
Masking strengths: Progressive levels (0.25, 0.35, 0.5, 0.65, 0.8, 0.9) across skeleton ensemble

Example with Multiple Providers

from hallucination_toolkit import OpenAIItem, OpenAIPlanner
from htk_backends import OllamaBackend

backend = OllamaBackend(model="mixtral:8x7b-instruct")
item = OpenAIItem(
    prompt="Who won the 2019 Nobel Prize in Physics?",
    n_samples=7,
    m=6,
    skeleton_policy="closed_book"
)

planner = OpenAIPlanner(backend, temperature=0.3)
metrics = planner.run([item], h_star=0.05)

API Surface

Core Classes

OpenAIBackend(model, api_key=None) – Original OpenAI wrapper
AnthropicBackend(model, api_key=None) – Anthropic Claude adapter
HuggingFaceBackend(mode, model_id, ...) – Hugging Face adapter (3 modes)
OllamaBackend(model, host) – Ollama local model adapter
OpenAIItem(prompt, n_samples=5, m=6, fields_to_erase=None, skeleton_policy="auto") – One evaluation item
OpenAIPlanner(backend, temperature=0.5, q_floor=None) – Runs evaluation (works with any backend):
- run(items, h_star, isr_threshold, margin_extra_bits, B_clip=12.0, clip_mode="one-sided") -> List[ItemMetrics]
- aggregate(items, metrics, alpha=0.05, h_star, ...) -> AggregateReport

Helper Functions

make_sla_certificate(report, model_name) – Creates formal SLA certificate
save_sla_certificate_json(cert, path) – Exports certificate for audit
generate_answer_if_allowed(backend, item, metric) – Only emits answer if decision was ANSWER

ItemMetrics Fields

Every ItemMetrics includes:

delta_bar: Information budget (nats)
q_conservative: Worst-case prior $q_{\text{lo}}$
q_avg: Average prior $\bar{q}$
b2t: Bits-to-Trust requirement
isr: Information Sufficiency Ratio
roh_bound: EDFL hallucination risk bound
decision_answer: Boolean decision
rationale: Human-readable explanation
meta: Dict with q_list, S_list_y, P_y, closed_book, etc.

Calibration & Validation

Validation Set Calibration

On a labeled validation set:

Sweep the margin parameter from 0 to 1 nats
For each margin, compute:
- Empirical hallucination rate among answered items
- Wilson upper bound at 95% confidence
Select smallest margin where Wilson upper bound ≤ target $h^*$ (e.g., 5%)
Freeze policy: $(h^*, \tau, \text{margin}, B, \text{clip_mode}, m, r, \text{skeleton_policy})$

Portfolio Reporting

The toolkit provides comprehensive metrics:

Answer/abstention rates
Empirical hallucination rate + Wilson bound
Distribution of per-item EDFL RoH bounds
Worst-case and median risk bounds
Complete audit trail

Practical Considerations

Choosing the Right Provider

Provider	Best For	Considerations
OpenAI	Production deployment, consistent JSON	Requires API key, costs per token
Anthropic	High-quality reasoning, safety-critical	Requires API key, may have rate limits
OpenRouter	Multi-model testing, cost optimization	Single API for 100+ models, automatic fallbacks
HuggingFace (Local)	Full control, no API costs	Requires GPU, setup complexity
HuggingFace (TGI)	Team deployments, caching	Requires server setup
HuggingFace (API)	Quick prototyping	Rate limits, requires HF token
Ollama	Local experimentation	Easy setup, model quality varies

Performance Characteristics by Provider

Provider	Latency per Item	Cost	Setup Complexity
OpenAI	2-5 seconds	~$0.01-0.03	Low
Anthropic	3-6 seconds	~$0.02-0.05	Low
HF Local	1-10 seconds	Free (GPU cost)	Medium-High
HF TGI	1-3 seconds	Server costs	High
HF API	3-8 seconds	Free tier/paid	Low
Ollama	2-15 seconds	Free (local)	Low

Common Issues & Solutions

Issue: Non-JSON responses from local models

Solution: Use instruction-tuned model variants (e.g., -Instruct suffixes)

Issue: Different risk bounds across providers

Expected: Models have different knowledge/calibration; the framework adapts accordingly

Issue: Timeouts with local models

Solution: Increase request_timeout parameter or reduce batch size

Project Layout

.
├── app/                    # Application entry points
│   ├── web/web_app.py     # Streamlit UI
│   ├── cli/frontend.py    # Interactive CLI
│   ├── examples/          # Example scripts
│   └── launcher/entry.py  # Unified launcher
├── hallbayes/             # Core modules
│   ├── hallucination_toolkit.py  # Main toolkit
│   ├── htk_backends.py          # Universal backend adapters
│   └── build_offline_backend.sh
├── electron/              # Desktop wrapper
├── launch/                # Platform launchers
├── release/              # Packaged artifacts
├── bin/                  # Offline backend binary
├── requirements.txt
├── pyproject.toml
└── README.md

Deployment Options

1. Direct Python Usage (Any Provider)

from hallbayes import OpenAIPlanner, OpenAIItem, make_sla_certificate
from hallbayes.htk_backends import AnthropicBackend  # or any other backend

# Choose your provider
backend = AnthropicBackend(model="claude-3-5-sonnet-latest")

# Configure and run
items = [OpenAIItem(prompt="...", n_samples=7, m=6)]
planner = OpenAIPlanner(backend, temperature=0.3)
metrics = planner.run(items, h_star=0.05)

# Generate SLA certificate
report = planner.aggregate(items, metrics)
cert = make_sla_certificate(report, model_name="Claude-3.5-Sonnet")
save_sla_certificate_json(cert, "sla.json")

2. Web Interface (Streamlit)

streamlit run app/web/web_app.py

3. Batch Processing with Multiple Providers

from hallucination_toolkit import OpenAIPlanner, OpenAIItem
from htk_backends import AnthropicBackend, OllamaBackend
import json

# Load prompts
with open("prompts.json") as f:
    prompts = json.load(f)

# Setup providers
providers = {
    "claude": AnthropicBackend(model="claude-3-5-sonnet-latest"),
    "llama": OllamaBackend(model="llama3.1:8b-instruct")
}

# Process with each provider
results = {}
for name, backend in providers.items():
    planner = OpenAIPlanner(backend, temperature=0.3)
    items = [OpenAIItem(prompt=p, n_samples=5, m=6) for p in prompts]
    metrics = planner.run(items, h_star=0.05)
    results[name] = planner.aggregate(items, metrics)

Quick Migration Guide

If you're already using the toolkit with OpenAI, here's how to try other providers:

# Original (OpenAI only)
from hallucination_toolkit import OpenAIBackend
backend = OpenAIBackend(model="gpt-4o-mini")

# New (Any provider) - just change these two lines:
from htk_backends import AnthropicBackend  # or HuggingFaceBackend, OllamaBackend
backend = AnthropicBackend(model="claude-3-5-sonnet-latest")

# Everything else stays exactly the same!
planner = OpenAIPlanner(backend, temperature=0.3)
# ... rest of your code unchanged

Based on the Paper: Predictable Compression Failures: Why Language Models Actually Hallucinate - https://arxiv.org/abs/2509.11208

License

This project is licensed under the MIT License – see the LICENSE file for details.

Attribution

Developed by Hassana Labs (https://hassana.io).

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
app		app
build		build
drkg		drkg
electron		electron
hallbayes.egg-info		hallbayes.egg-info
hallbayes		hallbayes
launch		launch
release		release
.gitignore		.gitignore
LICENSE		LICENSE
OPENAI_IMPROVEMENTS.md		OPENAI_IMPROVEMENTS.md
README.md		README.md
arxivPreprint.pdf		arxivPreprint.pdf
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
sample_prompts.txt		sample_prompts.txt
setup.py		setup.py

License

leochlon/hallbayes

Folders and files

Latest commit

History

Repository files navigation

Hallucination Risk Calculator & Prompt Re-engineering Toolkit (Multi-Provider Support)

🎯 Key Features

Table of Contents

Install & Setup

Basic Installation

API Keys Setup

Supported Model Providers

1. OpenAI (Original)

2. Anthropic (Claude)

3. Hugging Face (Three Modes)

4. Ollama

5. OpenRouter (Recommended for Production)

Quick Start Examples

Using OpenAI (Original)

Using Anthropic (Claude)

Using Hugging Face (Local)

Using Hugging Face (TGI Server)

Using Hugging Face (Inference API)

Using OpenRouter (Recommended for Multi-Model Testing)

Using Ollama

Backend Configuration Details

AnthropicBackend

OpenRouterBackend

HuggingFaceBackend

Mode 1: Local Transformers

Mode 2: TGI Server

Mode 3: Inference API

OllamaBackend

Comparing Providers

Core Mathematical Framework

The EDFL Principle

Decision Rule (SLA Gating)

Understanding System Behavior

Expected Behavioral Patterns

Simple Arithmetic Queries

Named-Entity Factoids

Provider-Specific Considerations

Two Ways to Build Rolling Priors

1) Evidence-based (when you have context)

2) Closed-book (no evidence)

API Surface

Core Classes

Helper Functions

ItemMetrics Fields

Calibration & Validation

Validation Set Calibration

Portfolio Reporting

Practical Considerations

Choosing the Right Provider

Performance Characteristics by Provider

Common Issues & Solutions

Issue: Non-JSON responses from local models

Issue: Different risk bounds across providers

Issue: Timeouts with local models

Project Layout

Deployment Options

1. Direct Python Usage (Any Provider)

2. Web Interface (Streamlit)

3. Batch Processing with Multiple Providers

Quick Migration Guide

License

Attribution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages