Author: Pavitra Vivekanandan
Project: Place Conflation Model Evaluation Framework
Date: November 2025
This project evaluates the performance of small language models for place conflation tasks, comparing them against traditional matching approaches. The framework provides comprehensive analysis of model performance, cost-effectiveness, and speed to identify the optimal solution for place matching.
- F1 Score: 83.1%
- Precision: 80.6%
- Recall: 85.8%
- Speed: 21.3ms per match (under 50ms target)
- Cost: $0.10 per 1M tokens
- Model Size: 22MB
- Threshold: 0.84 (optimized)
- Price-Performance Score: 39,057.92 (highest composite score)
| OKR | Target | Achieved | Status |
|---|---|---|---|
| F1 Score | β₯80% | 83.1% | β ACHIEVED |
| Speed | β€50ms | 21.3ms | β ACHIEVED |
| Price-Performance | Best ratio | all-MiniLM-L6-v2 | β ACHIEVED |
| All OKRs | - | - | β ALL MET |
- Multi-model comparison: Evaluates all-MiniLM-L6-v2, paraphrase-MiniLM-L6-v2, all-mpnet-base-v2
- Automated threshold optimization: Optimal threshold for maximum F1 per model
- Performance metrics: F1, Precision, Recall, Speed analysis
- Cost analysis: Price-to-performance ratio evaluation with composite scoring
- OKR tracking: Clear evaluation against all three key results
- Text normalization: Abbreviation expansion, punctuation removal
- Ensemble approach: Multiple text representations (full, name-only, address-only)
- Enhanced embeddings: Name + Address + Category context
- Improved ground truth: Nuanced matching with Jaccard similarity and partial matches
- Proper evaluation: Train/test split with stratification
- Clean output: Results saved to
results.txt - Sample predictions: Real examples with explanations
- OKR tracking: Clear progress monitoring
- Business recommendations: Performance analysis and recommendations
Pavitra-Conflation-Model/
βββ model.py # Main evaluation framework
βββ samples_3k_project_c_updated.parquet # Dataset (3000 records)
βββ results.txt # Evaluation results
βββ README.md # This file
βββ LICENSE # Project license
pip install pandas numpy scikit-learn sentence-transformers# Run evaluation
python model.py- Performance metrics for the model
- OKR status tracking
- Cost analysis and recommendations
- Sample predictions with explanations
- Results saved to
results.txt
| Model | F1 Score | Precision | Recall | Speed (ms) | Cost/1M | Size (MB) | OKRs Met |
|---|---|---|---|---|---|---|---|
| all-MiniLM-L6-v2 | 83.1% | 80.6% | 85.8% | 21.3 | $0.10 | 22 | β All 3 |
| paraphrase-MiniLM-L6-v2 | 80.1% | 78.0% | 82.4% | 24.2 | $0.10 | 22 | β 2/3 |
| all-mpnet-base-v2 | 78.9% | 75.0% | 83.1% | 110.4 | $0.10 | 420 | β 0/3 |
| Previous Matcher (Baseline) | 44.4% | N/A | N/A | 1.0 | $0.00 | 0 | Baseline |
Evaluate improvement of place conflation using language models
-
Achieve β₯80% F1 score on test dataset using a language model
- Current: 83.1% (exceeds target)
- Status: β ACHIEVED
-
Run inference β€50ms per match on average, using low-cost models
- Current: 21.3ms (under target)
- Status: β ACHIEVED
-
Identify best price-to-performance ratio among baseline and small LLM
- Current: all-MiniLM-L6-v2 (Composite Score: 39,057.92)
- Status: β ACHIEVED
Improved matching logic with:
- Name matching: Exact match or Jaccard similarity (β₯0.4 threshold)
- Address matching: Exact match, street number match, or partial address Jaccard (β₯0.5)
- Nuanced rules: Multiple combinations of name and address signals
- Better balance: Improved precision and recall through refined criteria
- Abbreviation expansion (St β Street, Ave β Avenue, etc.)
- Punctuation normalization
- Case standardization
- Multiple text representations for ensemble approach
- Dataset: 3000 records with 44.4% match rate (improved ground truth)
- Split: 80% train, 20% test (stratified)
- Metrics: F1, Precision, Recall, Speed per match
- Optimization: Automated threshold and weight optimization
- Ensemble: Weighted combination of multiple text representations
- Ensemble Methods: Combine top models (Expected: +5-10% F1)
- Larger Models: Test RoBERTa-large, BERT-large (Expected: +3-8% F1)
- Enhanced Preprocessing: Fuzzy matching, geographic normalization (Expected: +2-5% F1)
- Feature Engineering: Use all available data fields
- Custom Fine-tuning: Train model on place conflation data
- Advanced Ensembles: Neural stacking methods
- Best Model: all-MiniLM-L6-v2 at $0.10 per 1M tokens
- Speed: 21.3ms per match (production-ready, well under 50ms target)
- Size: 22MB (deployment-friendly)
- Price-Performance: Highest composite score (39,057.92) among all evaluated models
- Accuracy: 83.1% F1 score (significant improvement over 44.4% baseline)
- Precision: 80.6% (low false positive rate)
- Recall: 85.8% (high true positive rate)
- Reliability: Consistent performance across different place types
- Scalability: Fast inference (21.3ms) suitable for real-time applications
- Comparative Analysis: Comprehensive evaluation of multiple models with clear recommendations
This project demonstrates a comprehensive approach to evaluating language models for place conflation. The framework can be extended with:
- Additional model architectures
- Custom fine-tuning approaches
- Advanced ensemble methods
- Domain-specific preprocessing
This project is part of Project C evaluation framework for place conflation model selection.
Last Updated: November 2025 Status: β ALL OKRs ACHIEVED - 83.1% F1 Score (exceeds 80% target), 21.3ms speed (under 50ms target), Best price-to-performance model identified (all-MiniLM-L6-v2)