Classifies whether a POI is Open or Closed based on its digital footprint and recency signals from Overture Maps releases.
| Metric | Open | Closed |
|---|---|---|
| Balanced Accuracy | 89.29% | |
| AUC | 95.30% | |
| Precision | 99.2% | 75.9% |
| Recall | 79.5% | 99.0% |
Evaluated on 46,907 hold-out rows (Chicago + Miami, never seen during training). Threshold = 0.50. Ensemble: CatBoost-A × 0.7 + LightGBM-A × 0.3.
Top Features (CatBoost-A):
| Rank | Feature | Importance | Description |
|---|---|---|---|
| 1 | recency_spread |
29.8% | Range between oldest and newest source update timestamps |
| 2 | recency_pca |
21.8% | PCA of recency metrics (fit on training rows only) |
| 3 | zombie_score |
19.7% | Source count / avg staleness — "database purgatory" signal |
| 4 | identity_change_score |
12.8% | Sum of name, category, and address changes |
| 5 | is_brand |
2.6% | Place matches a known brand chain |
| 6 | total_digital |
2.3% | Count of distinct digital presence types |
| 7 | category_primary |
2.2% | Business category (CatBoost native encoding) |
| 8 | consecutive_present |
1.6% | Longest consecutive run of release appearances |
| 9 | has_phone |
1.6% | Phone number present in base snapshot |
| 10 | releases_seen |
1.4% | Number of releases this place appeared in before closure |
High-Quality Closed (HQC) labels + 60/40 rebalancing
- Closed label now requires a place to be present in 2 consecutive past releases and absent in the next — confirmed churners with trajectory history.
- All 142,931 HQC places kept (no cap). Open downsampled globally to 60/40.
- Previous pipeline capped closed at 3,000/pair and discarded 97% of available signal.
Leak fixes applied on top of V7:
| Issue | Root Cause | Fix |
|---|---|---|
| PCA fitted on full dataset | recency_pca was computed before train/test split; hold-out data influenced PCA direction |
PCA now fit on training rows only in step3 (after split); days_latest/days_avg passed as passthrough columns |
| Hold-out used for optimisation | Ensemble weights + threshold searched over 918 combinations against y_test, then reported as accuracy |
Weights and threshold now chosen via OOF (cross_val_predict on y_train); hold-out used only for final unbiased reporting |
| Single reference date across pairs | Staleness computed against the newest release date for all rows; pair-0 places appeared ~28 days older than pair-1 with identical update dates | Recency computed per release_date_current group so each pair uses its own prediction-window endpoint |
| Digital presence used post-event values | has_website, num_socials, etc. used COALESCED (R_{i+1}) data for open places but R_i data for churned places — asymmetric measurement window |
All presence features now use base_* columns (R_i) for both classes |
| LightGBM missing category feature | LGBM received numeric features only, missing the 7.7%-importance category_primary |
LabelEncoder fitted on training rows; LGBM receives category_encoded |
V7 leak fixes (still applied):
| Issue | Fix |
|---|---|
| Double-encoded JSON zeroed digital presence, sources, recency | CAST(AS VARCHAR) in step1 SQL |
releases_seen=2 was a proxy for label=0 |
Anchor both future churners AND equal-size stable-open sample in pair-0 |
| COALESCE-induced staleness asymmetry | Staleness from base_sources for all places |
Overture confidence signal |
Removed; 5 confidence-derived features dropped |
V5 leak fixes (still applied):
| Issue | Fix |
|---|---|
confidence NaN-fill → perfect closed signal |
Use base_confidence only; drop delta_confidence, confidence_momentum |
category_churn_risk computed globally from all labels |
Removed; category_primary passed as CatBoost native categorical |
Drop Overture release parquets into overture_releases/ (see overture_releases/README.md for naming convention) then run:
# Build release files from raw per-city parquets (one-time setup)
python scripts/data_processing/build_release_files.py
# Run the full pipeline: data → features → training
# Default holdout: Chicago + Miami; default balance: 60% open / 40% closed
python pipeline/run_pipeline.py
# Trained models → pipeline_output/models/Key options:
python pipeline/run_pipeline.py \
--holdout-cities chicago miami \
--target-open-rate 0.6 \
--cv-folds 5With 3+ releases the pipeline activates trajectory features (pre_closure_loss, social_trend, releases_seen, consecutive_present) that capture pre-closure behaviour — directly addressing the 2-release limitation where all delta features are 0 for churned places by construction.
See pipeline/README.md for the full guide.
Sits on top of the classifier and researches low-confidence predictions (default threshold: 0.65) via targeted web search + LLM verdict.
| Mode | Script | Use Case |
|---|---|---|
| Sync (interactive) | scripts/agent/main.py |
Approval-gated: review the research plan before execution |
| Async (high-throughput) | scripts/agent/async_main.py |
3 parallel research workers + live dashboard |
Requires GROQ_API_KEY and TAVILY_API_KEY. See docs/v6_agent_architecture.md.
StatusNow/
├── overture_releases/ ← Drop Overture parquet releases here
│ └── README.md
│
├── pipeline/ ← Training pipeline (start here)
│ ├── run_pipeline.py ← Single command to train a new model
│ ├── step1_build_training_data.py
│ ├── step2_feature_engineering.py
│ ├── step3_train.py
│ └── README.md
│
├── scripts/
│ ├── data_processing/
│ │ ├── build_release_files.py ← Build overture_releases/ parquets
│ │ ├── fetch_overture_expanded.py ← Fetch any city from Overture S3
│ │ ├── build_truth_expanded.py ← Build + merge multi-city truth datasets
│ │ └── merge_cities.py
│ │
│ ├── experiments/
│ │ ├── v5_train_best.py ← Train best model, export predictions
│ │ ├── v5_full_benchmark.py ← Full CV + all models + ensemble search
│ │ ├── v6_enrichment_experiment.py
│ │ └── exp_predictive_labels.py ← R2-oracle experiment (see below)
│ │
│ ├── agent/ ← V6 AI agent layer
│ │ ├── main.py
│ │ ├── async_main.py
│ │ ├── config.py
│ │ ├── llm/interface.py
│ │ ├── ingest.py
│ │ ├── planner.py
│ │ ├── executor.py
│ │ └── schemas.py
│ │
│ ├── research/ ← Research history (V3 → V5)
│ │ ├── README.md
│ │ ├── v5_holdout_eval.py
│ │ ├── process_data_v5.py
│ │ └── ...
│ │
│ └── archived/ ← V1/V2 era scripts
│
└── data/
└── combined_truth_dataset_expanded.parquet ← V4 gold standard (123k rows, 12 cities)
This section chronicles our progress from the initial baseline to the current pipeline.
- Goal: Establish a baseline using "Delta Features" (comparing historical baseline vs current data).
- Method: Calculated net change in websites, socials, and phones.
- Key Insight:
has_gained_social(r=+0.26) was the strongest single predictor.has_any_loss(r=-0.17) was a reliable closure signal. - Result: 67.3% Balanced Accuracy. Knowing that something changed was good, but not enough.
- Goal: Capture nuance with Interaction Features and PCA.
- Innovation:
- Zombie Score: Identified places with many sources but stale data ("Database Purgatory").
- Category Risk: Modeled that gas stations close less often (10% churn) than boutiques (45% churn).
- PCA: Reduced redundancy between correlated recency features (98% variance explained).
- Result: 70.65% Balanced Accuracy. Temporal context ("when did it change?") proved critical.
- Goal: Tackle label noise in the manually labeled dataset.
- Innovation: "Dynamic Label Refinement" using 5-fold cross-validation.
- Findings: Identified 65 samples (2.2%) where the model was >90% confident the human label was wrong.
- Result: Removing these likely errors boosted accuracy to 72.09%.
- Goal: Validate concepts on a larger, cleaner, ground-truth dataset.
- Replication Method (Script:
scripts/data_processing/build_truth_dataset.py):- Fetch Data: Used
fetch_overture_data.pyto download places from Overture S3 (Jan 2026 vs Feb 2026) for NYC BBox. - Define Closed: A place is considered closed if:
- It existed in the Previous release but is missing ID in the Current release (churned).
- OR it exists in Current but explicitly has
operating_status = 'closed'.
- Define Open: Exists in Current and
operating_status != 'closed'. - Balance: Downsampled to 3k Open / 3k Closed to match Season 2 distribution.
- Fetch Data: Used
- Result: 92.87% Balanced Accuracy.
- Major Lesson: The V3 features were highly effective, but the original dataset's noise and size were holding them back.
- Warning: We discovered a massive performance gap between Brands (97% Accuracy) and Small Businesses (67% Accuracy), suggesting future work should treat them as separate problems.
- Goal: Validate if the model works beyond NYC.
- Method: Replicated the pipeline for San Francisco (SF) and created a combined dataset.
- Results:
- SF Accuracy: 91.39% (despite fewer closed samples).
- Combined Model: 85.21% Balanced Accuracy on 18,619 samples.
- Key Insight: The initial 95% result was inflated by a data leak (Confidence score). After fixing it, the model stabilized at ~85%, and uniquely, the Brand Gap disappeared (Brands vs Non-Brands now perform equally).
- Goal: Improve from 85% → 90% Balanced Accuracy.
- Leakage Discovery:
processed_for_ml_testing.parquetwas built withconfidence = 0for 3,000 churned NYC places (NaN-fill bug). This gave the model a near-perfect closed signal — true leak-free baseline was 80.5%.category_churn_risk(computed globally from labels) also contributed minor leakage. - Strategy: Scale the dataset dramatically across diverse cities using Overture S3.
- Data Expansion: Fetched 10 new US cities (Chicago, LA, Houston, Phoenix, Philadelphia, Seattle, Denver, Boston, Miami, Atlanta) → 123,082 samples from 12 cities.
- V4 Features: Extended to 95 features — added identity-change signals (
name_changed,website_domain_changed,identity_change_score), richer per-channel gain/loss flags, and interaction terms. - Results (leaky CV): CatBoost + LightGBM ensemble: 89.18%
- Key Insight: More data >> better models. HPO added only ~0.1 pp; going from 12k → 123k added ~8.7 pp.
- Goal: Produce an honest, production-grade evaluation with all leakages fixed.
- Leakage Audit:
confidenceNaN-fill: churned places (93.7% of closed) hadconfidence=null→ filled with 0 → near-perfect closed signal. Fix: usebase_confidence(Jan 2026 value) only. Dropdelta_confidenceandconfidence_momentum.category_churn_riskcomputed globally from all 123k labels before CV → 0.50 correlation with target. Fix: removed; replaced withcategory_primaryas CatBoost native categorical feature (fold-safe internal target encoding).- Evaluation: all CV was on the same 12 cities. Fix: geographic hold-out — Chicago + Miami held out completely.
- Data Architecture Insight: In the 2-release dataset (
Jan 2026 = base,Feb 2026 = current), churned places (closed by disappearing) havecurrent = COALESCE(null, prev) = prev, so all delta features are 0 by construction for 93.7% of closed places. This is a structural limitation of 2-release data. A 3rd release would provide legitimate pre-closure deltas. - Operating Status Note:
operating_status = 'closed'appears in only 1–2 places per city in current Overture data. Closures are expressed as churning (disappearance between releases), not explicit status flags. Using operating_status alone as the closed label is not viable with current Overture data. - Results: CB+LGBM ensemble on Chicago + Miami hold-out: 89.41% (w_CB=0.7, thresh=0.52).
- Scripts:
scripts/research/process_data_v5.py,scripts/research/v5_holdout_eval.py.
- Goal: Break the 2-release structural ceiling (all delta features = 0 for churned places) and fix remaining data leaks.
- 3rd Release: Added Overture
2026-03-18.0for all 12 cities viascripts/data_processing/build_release_files.py. With 3 releases → 2 consecutive comparison pairs → trajectory features activated. - Leak Fixes:
- Double-encoded JSON (
to_json()on VARCHAR columns): all digital presence, sources, and recency features were silently zeroed out. Fix:CAST(AS VARCHAR)in step1 SQL. - Constructed
releases_seenleak: only future churners were force-included in pair 0's open set, makingreleases_seen=2a near-perfect proxy forlabel=0. Fix: also anchor a matching sample of future non-churners soreleases_seen=2occurs for both classes. - COALESCE-induced staleness leak:
log_dayswas computed from the COALESCEDsourcescolumn. Closed places (sources from prior release) appeared more stale than open places (sources from current release) by construction. Fix: compute staleness frombase_sourcesfor all places. - Overture
confidenceremoved: 5 confidence-derived features dropped (external quality signal with unclear provenance).
- Double-encoded JSON (
- City column propagated:
_cityfrom release parquets flows through step1 → step2 → step3, enabling city-name holdout (default: Chicago + Miami). - Results: CatBoost-C: CV 97.15%, hold-out 97.00%. Top features:
recency_spread(19.6%),zombie_score(16.7%),recency_pca(11.6%),log_days(9.3%). - Key Insight: The 89.41% V5 result was partially suppressed by silently zeroed features (the JSON double-encoding bug was present from the start). The true signal in Overture recency metadata is much stronger than previously measured.
- Question: Is the model learning real closure signals, or just detecting "this place is absent from the latest release?"
- Setup: Features built from R0→R1 window only (Jan→Feb). R2 (Mar) used exclusively as a label oracle — label=1 if present in R2, label=0 (HQC) if in R0+R1 but not R2. R2 data never touches the feature matrix.
- Result:
| Metric | Current Pipeline (R2 in features) | Experiment (R2 labels only) | Delta |
|---|---|---|---|
| Balanced Accuracy | 89.29% | 71.02% | −18.3 pp |
| AUC | 95.30% | 80.28% | −15.0 pp |
- Conclusion: The significant drop confirms the model is not simply memorising R2 presence. Genuine predictive signals exist in the R0→R1 feature window (name changes, digital presence shifts, source volatility, recency). The additional ~18 pp in the current pipeline comes from the multi-release feature window giving the model more temporal evidence — not from target leakage. Script:
scripts/experiments/exp_predictive_labels.py.
- Goal: Tighten the closed label definition, fix remaining leaks found in a full audit, and improve dataset balance.
- HQC Closed Labels: Redefined closed as places present in 2 consecutive past releases and absent in the next — confirmed churners with trajectory history. This yields 142,931 high-quality closed examples vs. the old 3,000/pair cap that discarded 97% of available signal. Dataset rebalanced globally to 60/40 (357k rows total).
- Leak Fixes:
- PCA fitted on full dataset:
recency_pcawas computed before the train/test split, so hold-out data influenced the PCA direction. Fix: PCA now fit on training rows only in step3 after the split;days_latest/days_avgpassed as passthrough columns from step2. - Hold-out used for optimisation: ensemble weights and threshold were searched over 918 combinations against
y_test, then reported as the hold-out accuracy — a form of test-set overfitting. Fix: weights and threshold now chosen via OOF predictions (cross_val_predictony_train); hold-out used only for final unbiased reporting. - Single reference date across pairs: staleness was computed against the newest release date for all rows, making pair-0 places appear ~28 days older than pair-1 places with identical update dates. Fix: recency computed per
release_date_currentgroup. - Digital presence used post-event values:
has_website,num_socials, etc. used COALESCED (R_{i+1}) data for open places but R_i data for churned places — an asymmetric measurement window. Fix: all presence features now usebase_*columns (R_i) for both classes. - LightGBM missing category feature: LGBM received numeric features only, missing the 7.7%-importance
category_primary. Fix:LabelEncoderfitted on training rows; LGBM receivescategory_encoded.
- PCA fitted on full dataset: