This project demonstrates a novel approach to improving AI model reasoning by leveraging token-level uncertainty metrics (logprobs) to create self-correcting generation loops. We compare this uncertainty-aware approach against traditional reasoning models to test whether explicit uncertainty handling can match or exceed the performance of dedicated reasoning architectures.
Modern transformers typically discard valuable uncertainty information during inference. This project explores whether we can harness this discarded information—specifically logprobs and top-k alternatives—to create more reliable and accurate AI responses without requiring specialized reasoning models.
We implement an uncertainty-aware generation loop that:
- Generates an initial response while tracking token-level uncertainty (perplexity)
- Automatically identifies regions of high uncertainty using logprobs
- Triggers a refinement pass when uncertainty exceeds a threshold
- Provides the model with explicit information about uncertain tokens and their alternatives
- Produces a refined, more accurate final response
Uncertainty metrics (logprobs) and top-k alternatives contain valuable reasoning signals that current transformer frameworks underutilize.
- Non-reasoning models with uncertainty loops (e.g., gpt-4.1-mini with our framework)
- Native reasoning models (e.g., o4-mini) - Note: These don't expose logprobs, so uncertainty analysis is not available
- Token-level perplexity
- Average log probabilities
- Response accuracy
- Token usage and costs
- Generation time
The project uses:
- OpenAI Responses API with
include=["message.output_text.logprobs"] - Weave by Weights & Biases for comprehensive experiment tracking and visualization
- Perplexity-based thresholds for triggering refinement
- Top-k alternatives for informing the model about uncertainty regions
Weave is essential for this project because it provides:
- Persistent experiment tracking - Every run, metric, and decision is logged and queryable
- Hierarchical operation tracing - See exactly how the uncertainty loop makes decisions
- Production-ready observability - Transform research experiments into deployable products
- Free tier available - Get started without any cost commitment
Get your free Weave API key at: https://wandb.ai/authorize
Weave enables us to:
- Track every token's uncertainty metrics across experiments
- Compare refinement decisions and their impacts
- Build a dataset of uncertainty patterns for future research
- Create reproducible experiments with full lineage tracking
- Visualize the relationship between uncertainty and answer quality
@weave.op()
def answer_difficult_question_with_uncertainty(
question: str,
model: str = "gpt-4.1-mini",
top_k: int = 5,
threshold: float = 1.4,
temperature: float = 0.2
):
# Initial generation with logprobs
# Calculate multiple uncertainty metrics:
# - Perplexity from average logprobs
# - Maximum entropy across tokens
# - Count of low-confidence tokens
# Multi-metric refinement trigger
# Conditional refinement with detailed uncertainty report
# Returns structured metrics and final answerOur implementation now uses multiple complementary metrics:
- Perplexity:
exp(-mean(log_probabilities))- Overall uncertainty measure - Token-level Entropy:
-sum(p * log(p))across top-k alternatives - Confidence Distribution: Count of tokens below confidence thresholds
- Contextual Analysis: Shows uncertain tokens with surrounding context
This project includes a vendorized version of polyfile-weave with fixes for Python 3.9+ compatibility.
# Create a virtual environment
python3 -m venv venv
# Activate the virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate
# Install dependencies (includes local polyfile-weave)
pip install -r requirements.txt
# Set up environment variables
cp env.example .env
# Edit .env with your API keysWeave provides essential observability for understanding how the uncertainty loop works:
- Get your free API key: Visit https://wandb.ai/authorize
- Add to your .env file:
WANDB_API_KEY=your-api-key-here WEAVE_PROJECT=weave-intro-notebook # or your custom project name - View your experiments: After running, visit the URL printed in console to explore:
- Token-by-token uncertainty metrics
- Refinement decision rationale
- Cost and performance comparisons
- Full conversation traces with hierarchical operations
The free tier includes:
- Unlimited public projects
- 100GB of storage
- Full access to Weave features
- No credit card required
Note:
- The vendorized
polyfile-weavepackage is included to fix compatibility issues with reserved keywords in the upstream package. - The script includes a runtime patch for Weave to enable gql 4.0+ compatibility (see our PR for the permanent fix).
# Option 1: Use .env file (recommended)
# Edit .env with your OPENAI_API_KEY
python wb-logprobs.py
# Option 2: Export environment variable
export OPENAI_API_KEY="sk-your-key-here"
python wb-logprobs.py
# Option 3: Pass a custom question
python wb-logprobs.py "Explain the halting problem and its implications"Weave Initialization Error:
If you encounter a TypeError when initializing Weave:
# Option 1: Install compatible gql version
pip install gql==3.4.1
# Option 2: Simply run the notebook - it will automatically handle the error
# The notebook includes fallback handling and can run without W&B trackingReasoning Model Compatibility: The code automatically handles differences between reasoning models (o1, o4) and standard models:
- Reasoning models don't support
temperatureorlogprobsparameters - The code detects model type and adjusts API calls accordingly
- Reasoning models won't have uncertainty metrics or refinement loops (no logprobs available)
- Both model types will run successfully for comparison purposes
The notebook is designed to run even if Weave initialization fails, so you can proceed with the uncertainty experiments regardless of tracking setup.
jupyter notebook wb-logprobs.ipynbOur comprehensive testing reveals impressive results:
- gpt-4.1-mini with uncertainty loop: 30-43% of o4-mini reasoning model cost
- Average cost per complex question: $0.0007-$0.0011 vs $0.0019-$0.0058
Testing on controversial and complex questions (AGI predictions, ethical implications, cryptocurrency debates):
- Comparable answer quality to reasoning models
- Improved confidence calibration through explicit uncertainty handling
- Reduced hallucination via targeted refinement
Our multi-metric approach catches uncertainty that single metrics miss:
- Perplexity threshold (>1.4)
- Maximum entropy (>1.5)
- High uncertainty token count (≥3 tokens <50% confidence)
Discovered significant performance characteristics:
- Simple questions: 2-6 seconds (faster than reasoning models)
- Complex technical questions: 54-67 seconds (API limitation, not our code)
- The more powerful the model, the slower the response (gpt-4.1: 99s, gpt-4o: 61s, gpt-4.1-mini: 67s)
- 2.75x cost reduction compared to reasoning models while maintaining quality
- Intelligent refinement - only triggers when genuinely uncertain (not for all responses)
- Rich uncertainty analysis provides context about specific uncertain tokens and alternatives
- Hierarchical logging via Weave enables deep analysis of the decision process
- Integrate pre-softmax hidden states
- Incorporate raw logits analysis
- Develop multi-layer uncertainty aggregation
- Build a production-ready inference server
- Implement streaming with real-time uncertainty monitoring
- Create adaptive thresholds based on task complexity
- Extend beyond OpenAI to open-source models
- Support for local inference with uncertainty extraction
- Develop uncertainty-aware fine-tuning methods
- Multi-turn conversation uncertainty tracking
- Uncertainty-guided retrieval augmentation
- Collaborative uncertainty resolution across model ensembles
Current transformer architectures make discrete token selections, discarding the rich probability distributions that could inform better reasoning. By capturing and utilizing this uncertainty information, we can:
- Reduce hallucinations by identifying when models are uncertain
- Improve accuracy through targeted refinement
- Lower costs compared to dedicated reasoning models
- Provide transparency about model confidence
This project demonstrates how Weave transforms experimental AI research into production-ready systems:
For Researchers:
- Every experiment is automatically versioned and comparable
- Uncertainty patterns become queryable datasets
- Collaborate with full experiment reproducibility
- Build on previous results without losing context
For Product Builders:
- Monitor uncertainty metrics in production
- Set alerts for high-uncertainty responses
- A/B test different uncertainty thresholds
- Track cost-performance tradeoffs in real-time
Data Persistence Benefits:
- All logprobs and uncertainty metrics are stored permanently
- Build training datasets from real uncertainty patterns
- Analyze long-term trends in model confidence
- Create uncertainty benchmarks for new models
The standard transformer inference pipeline:
- Discards logprobs after token selection
- Ignores uncertainty signals during generation
- Lacks self-correction mechanisms
- Provides no confidence metrics to downstream systems
Our approach addresses these limitations by treating uncertainty as a first-class citizen in the generation process.
For a comprehensive technical deep-dive including:
- Mathematical formulas and derivations
- Complete implementation details
- API response processing
- Example uncertainty reports
- Performance analysis
See TECHNICAL.md
Perplexity: exp(-mean(log_probabilities)) - Overall uncertainty measure
Entropy: -sum(p * log(p)) - Token-level uncertainty quantification
Decision Logic: Refinement triggers if:
- Perplexity > 1.4 OR
- Max entropy > 1.5 OR
- 3+ tokens with <50% confidence
Observability: Hierarchical @weave.op() tracking captures every decision and metric
We welcome contributions! Areas of particular interest:
- Alternative uncertainty metrics
- Multi-model uncertainty aggregation
- Visualization improvements
- Benchmark datasets for uncertainty-aware generation
- OpenAI Responses API Documentation
- Weave: LLM Application Development Framework
- Information Theory and Neural Networks
MIT License - See LICENSE file for details
- OpenAI for providing logprobs access via their APIs
- Weights & Biases team for the Weave framework
- The broader AI research community exploring uncertainty quantification
Project Status: Active Development (Phase 1: Benchmark Validation in Progress - August 2025)
Contact: andrew@monostate.ai or open an issue for questions or collaboration opportunities
Citation: If you use this work in your research, please cite:
@software{weave_logprobs_reasoning,
title = {Uncertainty-Aware Generation with Logprobs},
author = {Monostate},
year = {2025},
url = {https://github.com/monostate/weave-logprobs-reasoning-loop}
}We are currently working on:
- Running ARC-AGI benchmarks to validate abstract reasoning capabilities
- Testing on LogiQA 2.0 for logical reasoning validation
- GSM8K evaluation to compare math problem-solving with o4-mini
- Setting up automated benchmark pipeline with Weave tracking
- ARC-AGI - Abstract reasoning corpus
- LogiQA 2.0 - Logical reasoning in natural language
- GSM8K - Grade school math word problems
- MATH - Competition mathematics
- BigBench Hard - Challenging tasks from BIG-Bench
- MMLU - Massive multitask language understanding
- HumanEval - Code generation benchmarks
Goal: Demonstrate that uncertainty-aware loops achieve comparable or superior performance to reasoning models at 30-40% of the cost.
- WebArena - Realistic web navigation tasks
- Mind2Web - Web interaction benchmarks
- Custom browser automation with uncertainty-driven exploration
- API integration with uncertainty-aware retries
- Database query generation with confidence metrics
- File system operations with safety checks based on uncertainty
- Task decomposition with uncertainty propagation
- Hierarchical planning with confidence thresholds
- Rollback mechanisms triggered by high uncertainty
- Uncertainty-guided CoT: Use logprobs to identify where reasoning needs expansion
- Selective verbalization: Only elaborate on uncertain reasoning steps
- Confidence-weighted chains: Weight reasoning paths by aggregate certainty
- Standard CoT vs Uncertainty-aware CoT
- Few-shot prompting with uncertainty examples
- Zero-shot reasoning with automatic uncertainty detection
- Multiple sampling with uncertainty aggregation
- Weighted voting based on path confidence
- Early stopping when uncertainty converges
- Multi-model uncertainty aggregation
- Cross-model confidence calibration
- Selective model routing based on uncertainty profiles
- Identify high-uncertainty examples for human annotation
- Build uncertainty-aware training datasets
- Fine-tune models on uncertainty patterns
- Customer Support: Route uncertain queries to human agents
- Content Generation: Flag potentially problematic content based on uncertainty
- Medical/Legal AI: Mandatory uncertainty disclosure for high-stakes decisions
- Educational Tools: Adapt explanations based on model confidence
- Streaming uncertainty detection
- Real-time refinement triggers
- Uncertainty-aware caching strategies
- Cost optimization with dynamic thresholds
- Information-theoretic bounds on uncertainty reduction
- Optimal threshold learning algorithms
- Uncertainty propagation in multi-turn conversations
- Uncertainty-aware transformer variants
- Built-in refinement mechanisms
- Native uncertainty quantification layers
- Uncertainty patterns across different domains
- Domain-specific threshold calibration
- Transfer learning for uncertainty detection
- Accuracy: Match or exceed reasoning model baselines
- Cost: Maintain 30-40% cost ratio vs reasoning models
- Latency: Optimize for <2x latency of single-pass generation
- Reliability: <5% false positive refinement rate
- Benchmark Performance: Within 5% of reasoning model scores
- Cost Efficiency: Consistent 2.5-3x cost reduction
- User Studies: Preference for uncertainty-aware responses in blind tests
- Production Metrics: Reduced error rates in deployed systems
We invite researchers and practitioners to:
- Contribute benchmark results with your models and domains
- Share uncertainty patterns discovered in your applications
- Propose new metrics for uncertainty quantification
- Build integrations with other frameworks and tools
Join our efforts to make AI systems more reliable through uncertainty awareness!