CRWN102 Project A Submission
Kate Mikhailova
Based on prior work in Mayhem_Attribute_Conflation by Jaskaran Singh and Varnit Balivada
This project studies place attribute conflation: given two candidate versions of the same place record, choose whether the better value comes from the current side or the base side.
The attributes evaluated are:
namephonewebsiteaddresscategory
I built on top of the existing Mayhem pipeline, which already included:
- rule-based baselines:
Most RecentCompletenessConfidence- heuristic
Hybrid
- per-attribute ML trained on synthetic data
My work focused on:
- improving synthetic Yelp-based data realism
- expanding the labeled benchmark from 200 to 400 records
- improving ML confidence quality with calibration
- tuning decision thresholds for F1 instead of using fixed
0.5 - building and sweeping a hybrid router that selectively chooses between ML and rule-based methods
- exploring a learned router as a future direction
The main result is that pure ML improved substantially, but the best final system was still a selective hybrid, not full ML replacement.
Compared with the original Mayhem repo, I added or extended the following:
- Synthetic data realism improvements
- more realistic Yelp-derived corruptions
- attribute-specific perturbations
- near-equal and both-noisy cases
- labels based on quality rather than confidence shortcuts
- Expanded benchmark
- added
data/golden_dataset_400.json - added
data/golden_dataset_next_200_template.json
- added
- ML improvements
- per-attribute threshold tuning for F1
- calibration-aware confidence using Platt scaling
- refreshed training summaries and predictions
- Hybrid routing
- swept a large policy space to find the best per-attribute routing strategy
- saved best policy and summary artifacts
- Learned router
- implemented a meta-model that learns when to trust ML vs baseline
- evaluated it as an exploratory future direction
- Reporting
- unified experiment reports and combined summaries
- added overfitting/generalization analysis
Mayhem_Attribute_Conflation/
├── data/
│ ├── golden_dataset_200.json # Original labeled benchmark
│ ├── golden_dataset_400.json # Expanded benchmark used in later reruns
│ ├── golden_dataset_next_200_template.json # Template for collecting additional labels
│ ├── synthetic_golden_dataset_2k.json # Yelp-derived synthetic training set
│ ├── project_b_samples_2k.parquet # Original Overture sample records
│ ├── processed/
│ │ ├── features_*_synthetic.parquet # Per-attribute synthetic feature sets
│ │ └── golden_dataset_*.json # Train/validation/test splits and processed sets
│ └── results/
│ ├── experiment_reports/ # Main reports, logs, sweeps, and summaries
│ ├── ml_evaluation_200_real_*.json # ML evaluation artifacts
│ ├── ml_predictions_200_real_*.json # ML predictions on labeled real data
│ ├── predictions_baseline_*.json # Rule-based baseline predictions
│ ├── predictions_exp_step5_hybrid_router_*.json
│ ├── predictions_learned_hybrid_router_*.json
│ └── *_summary.json # Summary files for hybrid and learned router runs
│
├── docs/
│ ├── README.md # Documentation index
│ ├── AI_ANNOTATION_REPORT.md # Earlier annotation comparison report
│ ├── EVALUATION_PROTOCOL.md # Reproducible experiment protocol
│ ├── overfitting_report.md # Overfitting/generalization analysis
│ ├── kate_mayhem_presentation.md # Presentation draft used for final slides
│ └── ... # Guidelines, notes, reports, and project docs
│
├── models/
│ ├── ml/
│ │ ├── name/ # Best models and training summary for name
│ │ ├── phone/ # Best models and training summary for phone
│ │ ├── website/ # Best models and training summary for website
│ │ ├── address/ # Best models and training summary for address
│ │ └── category/ # Best models and training summary for category
│ ├── rule_based/ # Baseline evaluation outputs
│ └── hybrid/ # Hybrid evaluation outputs
│
├── notebooks/
│ └── colab_pipeline_setup.ipynb # Notebook workflow for Colab
│
├── scripts/
│ ├── run_algorithm_pipeline.py # Main pipeline orchestrator
│ ├── run_repro_eval.py # Reproducible experiment runner
│ ├── generate_synthetic_dataset.py # Yelp-based synthetic data generator
│ ├── extract_features.py # Feature engineering for attribute comparison
│ ├── train_models.py # Trains logistic regression / RF / GB per attribute
│ ├── run_inference.py # Runs the best ML model on target records
│ ├── baseline_heuristics.py # Most Recent / Completeness / Confidence / Hybrid rules
│ ├── run_hybrid_router.py # Applies a fixed hybrid router policy
│ ├── sweep_hybrid_router.py # Searches many hybrid routing policies
│ ├── run_learned_hybrid_router.py # Learned router meta-model
│ ├── evaluate_models.py # Shared evaluation utilities
│ ├── analyze_results.py # Builds comparison tables and reports
│ └── calculate_agreement.py # Agreement analysis for annotation workflows
│
└── yelp/
├── yelp_academic_dataset_business.json # Yelp business data used for synthetic generation
└── Dataset_User_Agreement.pdf # Yelp dataset license / termsThe original pipeline compared several heuristic strategies:
- Most Recent
- prefer the newer/current-side value
- Completeness
- prefer the side with more complete information
- Confidence
- prefer the side with the higher source confidence
- Hybrid
- heuristic combination of rule signals
These baselines were already strong and served as the reference point for all later work.
The ML system is a per-attribute tabular classification pipeline, not a deep neural network.
For each attribute:
- extract comparison features from
currentandbase - train three candidate models:
- logistic regression
- random forest
- gradient boosting
- select the best model by validation F1
- calibrate the probability output with Platt scaling
- tune the decision threshold for best validation F1
This produces a binary decision:
currentis better- or
baseis better
The hybrid router is the core system-level contribution.
Instead of trusting ML everywhere, it:
- uses a selected baseline as the safe default
- lets ML override only when the policy says confidence is strong enough
The best hybrid policy was found by sweeping a large policy space over:
- fallback baseline choice
- routing mode
- threshold values
This repository searched 541,696 policy configurations.
The learned router is a meta-model.
It does not directly predict current vs base.
Instead, it learns:
- should we trust ML here?
- or should we trust the baseline here?
This is an exploratory direction. In the current project it did not outperform the swept hybrid, but it remains a promising future improvement if more real labeled disagreement data is collected.
The original repo’s starting results were dominated by rule-based baselines:
| Method | Macro F1 |
|---|---|
| Most Recent | 0.8370 |
| Heuristic Hybrid | 0.8195 |
| Starting ML | 0.4600 |
This is why my work focused on improving ML quality and then combining ML with baselines more intelligently.
Pure ML improved substantially over the course of the project:
| Stage | ML Macro F1 |
|---|---|
| Starting ML | 0.4600 |
| Early cleanup | 0.7001 |
| Gated ML phase | 0.7356 |
| Final ML refresh | 0.8323 |
Current final comparisons from the combined report:
| Method | Macro F1 |
|---|---|
| Starting ML | 0.4600 |
| Final ML | 0.8323 |
Best baseline (Most Recent) |
0.8574 |
| Best swept hybrid | 0.8491 |
| Learned router v2 | 0.8040 |
The best swept hybrid policy was:
name- confidence-gated ML over
most_recent
- confidence-gated ML over
address- confidence-gated ML over
hybrid
- confidence-gated ML over
phone- baseline-only using
hybrid
- baseline-only using
website- baseline-only using
hybrid
- baseline-only using
category- baseline-only using
most_recent
- baseline-only using
This is the main project conclusion:
ML was useful, but only in specific places. The winning design was not to trust ML everywhere, but to let it help only where it had strong evidence.
I also analyzed whether the final ML setup showed signs of overfitting.
Main conclusion:
- there was no strong evidence of classic train-set overfitting
- but there was evidence of shortcut-like behavior on some attributes, especially a tendency for some methods to over-predict
current
See:
This report includes:
- train vs validation vs real F1 comparisons
- tuned cross-validation checks
- confusion-matrix counts
- FP/FN summaries
If a grader or reviewer only opens a few files, these are the most useful:
README.md- project overview and submission summary
data/results/experiment_reports/current_combined_report.txt- combined comparison table across methods
data/results/experiment_reports/exp_step5_hybrid_router_sweep.json- best hybrid policy and search summary
data/results/exp_step5_hybrid_router_best_summary.json- final best hybrid configuration
data/results/learned_hybrid_router_v2_summary.json- learned router summary
docs/overfitting_report.md- overfitting and generalization analysis
scripts/generate_synthetic_dataset.py- synthetic data generation logic
scripts/train_models.py- ML training, calibration, threshold tuning
scripts/sweep_hybrid_router.py- hybrid policy search
scripts/run_learned_hybrid_router.py- learned router implementation
git clone https://github.com/project-terraforma/Mayhem_Attribute_Conflation.git
cd Mayhem_Attribute_Conflation
pip install -r requirements.txtIf using Git LFS for large files:
git lfs pullpython scripts/run_algorithm_pipeline.pypython scripts/run_repro_eval.py --tag exp_name_v1 --mode full --attributes namepython scripts/analyze_results.pypython scripts/sweep_hybrid_router.pypython scripts/run_learned_hybrid_router.pyThe clearest next steps are:
- collect more real labeled data
- improve hard edge-case coverage
- strengthen per-class evaluation and false-positive / false-negative analysis
- revisit learned routing once more real disagreement data is available
- Original repository work by Jaskaran Singh and Varnit Balivada
- Data sources:
- Overture Maps Foundation
- Yelp Academic Dataset
- Course context:
- CRWN102
See LICENSE for terms.