Logprobs Reasoning Loop with Weights & Biases Weave, an observability tool

Uncertainty-Aware Generation with OpenAI's Responses API

This project demonstrates a novel approach to improving AI model reasoning by leveraging token-level uncertainty metrics (logprobs) to create self-correcting generation loops. We compare this uncertainty-aware approach against traditional reasoning models to test whether explicit uncertainty handling can match or exceed the performance of dedicated reasoning architectures.

Core Concept

Modern transformers typically discard valuable uncertainty information during inference. This project explores whether we can harness this discarded information—specifically logprobs and top-k alternatives—to create more reliable and accurate AI responses without requiring specialized reasoning models.

Key Innovation

We implement an uncertainty-aware generation loop that:

Generates an initial response while tracking token-level uncertainty (perplexity)
Automatically identifies regions of high uncertainty using logprobs
Triggers a refinement pass when uncertainty exceeds a threshold
Provides the model with explicit information about uncertain tokens and their alternatives
Produces a refined, more accurate final response

What We're Testing

Hypothesis

Uncertainty metrics (logprobs) and top-k alternatives contain valuable reasoning signals that current transformer frameworks underutilize.

Comparison

Non-reasoning models with uncertainty loops (e.g., gpt-4.1-mini with our framework)
Native reasoning models (e.g., o4-mini) - Note: These don't expose logprobs, so uncertainty analysis is not available

Metrics Tracked

Token-level perplexity
Average log probabilities
Response accuracy
Token usage and costs
Generation time

Technical Implementation

The project uses:

OpenAI Responses API with include=["message.output_text.logprobs"]
Weave by Weights & Biases for comprehensive experiment tracking and visualization
Perplexity-based thresholds for triggering refinement
Top-k alternatives for informing the model about uncertainty regions

Why Weave?

Weave is essential for this project because it provides:

Persistent experiment tracking - Every run, metric, and decision is logged and queryable
Hierarchical operation tracing - See exactly how the uncertainty loop makes decisions
Production-ready observability - Transform research experiments into deployable products
Free tier available - Get started without any cost commitment

Get your free Weave API key at: https://wandb.ai/authorize

Weave enables us to:

Track every token's uncertainty metrics across experiments
Compare refinement decisions and their impacts
Build a dataset of uncertainty patterns for future research
Create reproducible experiments with full lineage tracking
Visualize the relationship between uncertainty and answer quality

Core Components

@weave.op()
def answer_difficult_question_with_uncertainty(
    question: str,
    model: str = "gpt-4.1-mini", 
    top_k: int = 5,
    threshold: float = 1.4,
    temperature: float = 0.2
):
    # Initial generation with logprobs
    # Calculate multiple uncertainty metrics:
    #   - Perplexity from average logprobs
    #   - Maximum entropy across tokens
    #   - Count of low-confidence tokens
    # Multi-metric refinement trigger
    # Conditional refinement with detailed uncertainty report
    # Returns structured metrics and final answer

Enhanced Uncertainty Detection

Our implementation now uses multiple complementary metrics:

Perplexity: exp(-mean(log_probabilities)) - Overall uncertainty measure
Token-level Entropy: -sum(p * log(p)) across top-k alternatives
Confidence Distribution: Count of tokens below confidence thresholds
Contextual Analysis: Shows uncertain tokens with surrounding context

Getting Started

Prerequisites

This project includes a vendorized version of polyfile-weave with fixes for Python 3.9+ compatibility.

Setting up Virtual Environment (Required)

# Create a virtual environment
python3 -m venv venv

# Activate the virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate

# Install dependencies (includes local polyfile-weave)
pip install -r requirements.txt

# Set up environment variables
cp env.example .env
# Edit .env with your API keys

Setting up Weave Tracking (Recommended)

Weave provides essential observability for understanding how the uncertainty loop works:

Get your free API key: Visit https://wandb.ai/authorize

Add to your .env file:

WANDB_API_KEY=your-api-key-here
WEAVE_PROJECT=weave-intro-notebook  # or your custom project name

View your experiments: After running, visit the URL printed in console to explore:
- Token-by-token uncertainty metrics
- Refinement decision rationale
- Cost and performance comparisons
- Full conversation traces with hierarchical operations

The free tier includes:

Unlimited public projects
100GB of storage
Full access to Weave features
No credit card required

Note:

The vendorized polyfile-weave package is included to fix compatibility issues with reserved keywords in the upstream package.
The script includes a runtime patch for Weave to enable gql 4.0+ compatibility (see our PR for the permanent fix).

Running Locally (Python Script)

# Option 1: Use .env file (recommended)
# Edit .env with your OPENAI_API_KEY
python wb-logprobs.py

# Option 2: Export environment variable
export OPENAI_API_KEY="sk-your-key-here"
python wb-logprobs.py

# Option 3: Pass a custom question
python wb-logprobs.py "Explain the halting problem and its implications"

Troubleshooting

Weave Initialization Error: If you encounter a TypeError when initializing Weave:

# Option 1: Install compatible gql version
pip install gql==3.4.1

# Option 2: Simply run the notebook - it will automatically handle the error
# The notebook includes fallback handling and can run without W&B tracking

Reasoning Model Compatibility: The code automatically handles differences between reasoning models (o1, o4) and standard models:

Reasoning models don't support temperature or logprobs parameters
The code detects model type and adjusts API calls accordingly
Reasoning models won't have uncertainty metrics or refinement loops (no logprobs available)
Both model types will run successfully for comparison purposes

The notebook is designed to run even if Weave initialization fails, so you can proceed with the uncertainty experiments regardless of tracking setup.

Running the Notebook

jupyter notebook wb-logprobs.ipynb

Results & Insights

Performance Benchmarks

Our comprehensive testing reveals impressive results:

Cost Efficiency

gpt-4.1-mini with uncertainty loop: 30-43% of o4-mini reasoning model cost
Average cost per complex question: $0.0007-$0.0011 vs $0.0019-$0.0058

Quality Metrics

Testing on controversial and complex questions (AGI predictions, ethical implications, cryptocurrency debates):

Comparable answer quality to reasoning models
Improved confidence calibration through explicit uncertainty handling
Reduced hallucination via targeted refinement

Refinement Triggers

Our multi-metric approach catches uncertainty that single metrics miss:

Perplexity threshold (>1.4)
Maximum entropy (>1.5)
High uncertainty token count (≥3 tokens <50% confidence)

API Performance Analysis

Discovered significant performance characteristics:

Simple questions: 2-6 seconds (faster than reasoning models)
Complex technical questions: 54-67 seconds (API limitation, not our code)
The more powerful the model, the slower the response (gpt-4.1: 99s, gpt-4o: 61s, gpt-4.1-mini: 67s)

Key Findings

2.75x cost reduction compared to reasoning models while maintaining quality
Intelligent refinement - only triggers when genuinely uncertain (not for all responses)
Rich uncertainty analysis provides context about specific uncertain tokens and alternatives
Hierarchical logging via Weave enables deep analysis of the decision process

Future Roadmap

Phase 1: Extended Uncertainty Metrics

Integrate pre-softmax hidden states
Incorporate raw logits analysis
Develop multi-layer uncertainty aggregation

Phase 2: Full Inference Framework

Build a production-ready inference server
Implement streaming with real-time uncertainty monitoring
Create adaptive thresholds based on task complexity

Phase 3: Model-Agnostic Implementation

Extend beyond OpenAI to open-source models
Support for local inference with uncertainty extraction
Develop uncertainty-aware fine-tuning methods

Phase 4: Advanced Applications

Multi-turn conversation uncertainty tracking
Uncertainty-guided retrieval augmentation
Collaborative uncertainty resolution across model ensembles

Key Insights

Why This Matters

Current transformer architectures make discrete token selections, discarding the rich probability distributions that could inform better reasoning. By capturing and utilizing this uncertainty information, we can:

Reduce hallucinations by identifying when models are uncertain
Improve accuracy through targeted refinement
Lower costs compared to dedicated reasoning models
Provide transparency about model confidence

The Power of Observable AI with Weave

This project demonstrates how Weave transforms experimental AI research into production-ready systems:

For Researchers:

Every experiment is automatically versioned and comparable
Uncertainty patterns become queryable datasets
Collaborate with full experiment reproducibility
Build on previous results without losing context

For Product Builders:

Monitor uncertainty metrics in production
Set alerts for high-uncertainty responses
A/B test different uncertainty thresholds
Track cost-performance tradeoffs in real-time

Data Persistence Benefits:

All logprobs and uncertainty metrics are stored permanently
Build training datasets from real uncertainty patterns
Analyze long-term trends in model confidence
Create uncertainty benchmarks for new models

The Transformer Framework Gap

The standard transformer inference pipeline:

Discards logprobs after token selection
Ignores uncertainty signals during generation
Lacks self-correction mechanisms
Provides no confidence metrics to downstream systems

Our approach addresses these limitations by treating uncertainty as a first-class citizen in the generation process.

Technical Details

For a comprehensive technical deep-dive including:

Mathematical formulas and derivations
Complete implementation details
API response processing
Example uncertainty reports
Performance analysis

See TECHNICAL.md

Quick Overview

Perplexity: exp(-mean(log_probabilities)) - Overall uncertainty measure

Entropy: -sum(p * log(p)) - Token-level uncertainty quantification

Decision Logic: Refinement triggers if:

Perplexity > 1.4 OR
Max entropy > 1.5 OR
3+ tokens with <50% confidence

Observability: Hierarchical @weave.op() tracking captures every decision and metric

Contributing

We welcome contributions! Areas of particular interest:

Alternative uncertainty metrics
Multi-model uncertainty aggregation
Visualization improvements
Benchmark datasets for uncertainty-aware generation

References

License

MIT License - See LICENSE file for details

Acknowledgments

OpenAI for providing logprobs access via their APIs
Weights & Biases team for the Weave framework
The broader AI research community exploring uncertainty quantification

Project Status: Active Development (Phase 1: Benchmark Validation in Progress - August 2025)

Contact: andrew@monostate.ai or open an issue for questions or collaboration opportunities

Citation: If you use this work in your research, please cite:

@software{weave_logprobs_reasoning,
  title = {Uncertainty-Aware Generation with Logprobs},
  author = {Monostate},
  year = {2025},
  url = {https://github.com/monostate/weave-logprobs-reasoning-loop}
}

Roadmap: Next Steps & Validation

Immediate Next Steps (August 2025)

We are currently working on:

Running ARC-AGI benchmarks to validate abstract reasoning capabilities
Testing on LogiQA 2.0 for logical reasoning validation
GSM8K evaluation to compare math problem-solving with o4-mini
Setting up automated benchmark pipeline with Weave tracking

Phase 1: Benchmark Validation (Q3 2025 - Current)

Reasoning Benchmarks

ARC-AGI - Abstract reasoning corpus
LogiQA 2.0 - Logical reasoning in natural language
GSM8K - Grade school math word problems
MATH - Competition mathematics
BigBench Hard - Challenging tasks from BIG-Bench
MMLU - Massive multitask language understanding
HumanEval - Code generation benchmarks

Goal: Demonstrate that uncertainty-aware loops achieve comparable or superior performance to reasoning models at 30-40% of the cost.

Phase 2: Agentic Applications (Q4 2025)

Browser Automation Tasks

WebArena - Realistic web navigation tasks
Mind2Web - Web interaction benchmarks
Custom browser automation with uncertainty-driven exploration

Tool Use & Function Calling

API integration with uncertainty-aware retries
Database query generation with confidence metrics
File system operations with safety checks based on uncertainty

Multi-Step Planning

Task decomposition with uncertainty propagation
Hierarchical planning with confidence thresholds
Rollback mechanisms triggered by high uncertainty

Phase 3: Chain-of-Thought Enhancement (Q4 2025 - Q1 2026)

Explicit Reasoning Traces

Uncertainty-guided CoT: Use logprobs to identify where reasoning needs expansion
Selective verbalization: Only elaborate on uncertain reasoning steps
Confidence-weighted chains: Weight reasoning paths by aggregate certainty

Comparison Studies

Standard CoT vs Uncertainty-aware CoT
Few-shot prompting with uncertainty examples
Zero-shot reasoning with automatic uncertainty detection

Phase 4: Advanced Techniques (Q1 2026)

Self-Consistency with Uncertainty

Multiple sampling with uncertainty aggregation
Weighted voting based on path confidence
Early stopping when uncertainty converges

Uncertainty-Aware Ensembles

Multi-model uncertainty aggregation
Cross-model confidence calibration
Selective model routing based on uncertainty profiles

Active Learning Integration

Identify high-uncertainty examples for human annotation
Build uncertainty-aware training datasets
Fine-tune models on uncertainty patterns

Phase 5: Production Systems (Q1-Q2 2026)

Real-World Deployments

Customer Support: Route uncertain queries to human agents
Content Generation: Flag potentially problematic content based on uncertainty
Medical/Legal AI: Mandatory uncertainty disclosure for high-stakes decisions
Educational Tools: Adapt explanations based on model confidence

Infrastructure Development

Streaming uncertainty detection
Real-time refinement triggers
Uncertainty-aware caching strategies
Cost optimization with dynamic thresholds

Phase 6: Research Extensions (Q2 2026 - Ongoing)

Theoretical Analysis

Information-theoretic bounds on uncertainty reduction
Optimal threshold learning algorithms
Uncertainty propagation in multi-turn conversations

Novel Architectures

Uncertainty-aware transformer variants
Built-in refinement mechanisms
Native uncertainty quantification layers

Cross-Domain Transfer

Uncertainty patterns across different domains
Domain-specific threshold calibration
Transfer learning for uncertainty detection

Validation Metrics

Performance Targets

Accuracy: Match or exceed reasoning model baselines
Cost: Maintain 30-40% cost ratio vs reasoning models
Latency: Optimize for <2x latency of single-pass generation
Reliability: <5% false positive refinement rate

Success Criteria

Benchmark Performance: Within 5% of reasoning model scores
Cost Efficiency: Consistent 2.5-3x cost reduction
User Studies: Preference for uncertainty-aware responses in blind tests
Production Metrics: Reduced error rates in deployed systems

Community Collaboration

We invite researchers and practitioners to:

Contribute benchmark results with your models and domains
Share uncertainty patterns discovered in your applications
Propose new metrics for uncertainty quantification
Build integrations with other frameworks and tools

Join our efforts to make AI systems more reliable through uncertainty awareness!

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
polyfile-weave @ 94dbed9		polyfile-weave @ 94dbed9
scripts/debug_api_format		scripts/debug_api_format
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
TECHNICAL.md		TECHNICAL.md
requirements.txt		requirements.txt
wb-logprobs.ipynb		wb-logprobs.ipynb
wb-logprobs.py		wb-logprobs.py

Folders and files

Latest commit

History

Repository files navigation