CRWN102 Project B | Jaskaran Singh, Varnit Balivada
This project addresses the challenge of place attribute conflation for Overture Maps. When multiple data sources (e.g., Meta, Microsoft, Foursquare) describe the same real-world place, key attributes like name, address, phone, and category often conflict.
The Mayhem project implements an automated pipeline to select the most accurate and consistent attribute values, creating a high-quality, unified "Golden Record." We test two primary methods: Rule-Based Heuristics and Machine Learning Models, along with a Hybrid Ensemble, comparing their efficacy, scalability, and maintainability.
project/
├── data/
│ ├── golden_dataset_200.json # Manual ground truth (200 records, human-validated)
│ ├── synthetic_golden_dataset_2k.json # Synthetic training data derived from Yelp
│ ├── project_b_samples_2k.parquet # Original Overture input samples
│ ├── processed/ # Extracted features for ML training
│ └── results/ # Final metrics, reports, and conflated output files
│
├── docs/ # Project documentation & detailed reports
│ ├── OKRs.md # Detailed OKR tracking and progress
│ ├── attribute_guidelines.md # Labeling rules and edge case definitions
│ ├── edge_cases.md # Documented edge cases and resolutions
│ └── ...
│
├── models/ # Model artifacts and evaluation reports
│ ├── ml/ # Trained ML models (.joblib) and training summaries
│ │ ├── name/, phone/, ... # Subdirectories per attribute
│ ├── rule_based/ # Evaluation results for heuristic baselines
│ │ ├── eval_most_recent/
│ │ ├── eval_confidence/
│ │ └── eval_completeness/
│ └── hybrid/ # Evaluation results for hybrid ensemble approach
│
├── notebooks/ # Jupyter notebooks
│ └── colab_pipeline_setup.ipynb # Complete pipeline for Google Colab execution
│
├── scripts/ # Core Python pipeline scripts
│ ├── run_algorithm_pipeline.py # Main orchestrator (Data Gen -> Train -> Eval -> Inference)
│ ├── generate_synthetic_dataset.py # Generates synthetic training data from Yelp
│ ├── extract_features.py # Feature engineering (similarity, formats, etc.)
│ ├── train_models.py # ML training (Gradient Boosting, Random Forest, LogReg)
│ ├── baseline_heuristics.py # Rule-based logic implementation
│ ├── evaluate_models.py # Evaluation metrics calculation
│ └── run_inference.py # Inference engine for Overture records
│
└── yelp/ # Raw Yelp dataset (tracked via Git LFS)Clone the repository and install dependencies. Note that the Yelp dataset is managed with Git LFS.
# Clone the repo
git clone https://github.com/project-terraforma/Mayhem_Attribute_Conflation.git
cd Mayhem_Attribute_Conflation
# Install dependencies
pip install -r requirements.txt
# Pull LFS data (Yelp dataset)
git lfs pullThe master script orchestrates synthetic data generation, feature extraction, model training, evaluation, and final inference for all 5 attributes.
python scripts/run_algorithm_pipeline.pyOptions:
--attributes <list>: Specific attributes to run (e.g.,name phone). Default: all.--synthetic-limit <N>: Number of synthetic records to generate (default 2000). Use0for all ~150k records.
For faster training on the full dataset, use the provided notebook:
- Open
notebooks/colab_pipeline_setup.ipynbin Google Colab. - Run all cells to execute the pipeline in the cloud.
Generate a consolidated summary table of performance and compute metrics:
python scripts/analyze_results.pyThe project evaluated Machine Learning (Gradient Boosting/Logistic Regression), Rule-Based Baselines (Most Recent, Confidence, Completeness), and a Hybrid Ensemble approach.
Performance Metrics (F1-Score on 200 Real-World Records):
| Attribute | Best Approach | F1-Score | ML F1 | Baseline F1 | Hybrid F1 |
|---|---|---|---|---|---|
| Category | ML / Hybrid | 0.8338 | 0.8338 | 0.8338 | 0.8094 |
| Address | Hybrid / Rules | 0.8338 | 0.7921 | 0.8338 | 0.8338 |
| Phone | Hybrid / Rules | 0.8554 | 0.6929 | 0.8554 | 0.8554 |
| Website | Hybrid / Rules | 0.8323 | 0.4600 | 0.8323 | 0.8323 |
| Name | Rule-Based | 0.8338 | 0.2209 | 0.8338 | 0.7667 |
Key Insights:
- Hybrid Robustness: The Hybrid approach (combining Recency, Confidence, and Completeness) consistently matched the best-performing individual rules, proving to be a safe and robust default strategy.
- Rule-Based Wins: For structured attributes (Address, Phone), simple heuristics (especially "Most Recent") proved highly effective.
- ML Value: ML demonstrated value in complex attributes like Category, matching the best performance where simpler rules often struggled to differentiate nuances.
- Efficiency: The pipeline is extremely efficient, with inference times averaging ~0.002 ms per record, well below the 100ms target.
Implemented in scripts/baseline_heuristics.py.
- Most Recent: Selects based on source freshness.
- Confidence: Uses the upstream provider's confidence score.
- Completeness: Selects the value with the most information (e.g., fields in JSON).
Implemented in scripts/baseline_heuristics.py.
- Ensemble: Combines votes from Recency (30%), Confidence (50%), and Completeness (20%) to make a robust decision.
Implemented in scripts/train_models.py and scripts/extract_features.py.
- Training: Trained on 2,000-10,000+ synthetic records generated from Yelp data to simulate "good" vs. "bad" attributes.
- Features: String similarity (Levenshtein, Jaro-Winkler), formatting checks (HTTPS, international phone), and metadata features.
- Models: Automatically selects between Logistic Regression, Random Forest, and Gradient Boosting based on validation F1.
- Data: Overture Maps Foundation, Yelp Academic Dataset.
- Project: Created for CRWN102 at UCSC.
See LICENSE for terms.