Open/Closed Place Prediction

Team

Clarice Park
Matthew Kimotsuki

Overview

This repository contains multiple lines of work around predicting whether places are open or closed from Overture-style data and related derived signals.

At a high level, the repo separates two main questions:

how well can we predict open vs closed using the available data together with the schema-native features we engineered from it?
can models be incrementally trained across data releases (warm‑start) to match full-retrain performance, and what minimum dataset size is required for stable closed-class learning?

The current repo is organized around:

a shared v2 modeling foundation
a ceiling study track focused on whether low- and medium-cost schema-native features could already meet the production gate on the provided Project C sample data
an incremental training / benchmarking track
archived historical labeling, modeling, and exploratory work

In this repo, ceiling study means the main evaluation track for testing whether low- and medium-cost schema-native features alone were enough to reach the project's production gate on the provided Project C sample data. The low, medium, and high cost definitions for these engineered features are documented in docs/ceiling_study/feature_inventory.csv and docs/ceiling_study/feature_rationale.md. The idea was to train on part of that dataset and evaluate on a holdout split to see whether a relatively cheap ML approach already looked strong enough for this problem, or whether better performance would likely require more expensive features, a larger training set, or both. In practice, this study exposed meaningful limitations, especially on closed-place performance. One likely challenge is that the roughly 3k-row sample is fairly spread out, which may make it harder for the model to learn a strong closed signal. As a result, the study suggests that low/medium-cost schema-native features alone may not be sufficient here, but it does not fully resolve whether the main bottleneck is feature cost, dataset size, or both.

Cumulative training is focused on answering the question of whether we can incrementally update a persisted model with new releases of data (batches) and achieve equivalent performance without retraining from scratch. This is meant to reproduce a production workflow where models continue training from saved state across data releases using library‑specific warm‑start semantics (scikit‑learn warm_start, XGBoost xgb_model=prev_booster, LightGBM init_model=prev_model). See docs/cumulative_training/README.md and src/cumulative_training/sf_ny_data/run_incremental_benchmark_sf_ny.py for the dataset driver and persistence details.

The goal of incremental benchmarking is to prove if the provided sample dataset is large enough to learn a stable closed‑class signal. Incremental benchmarking was designed to answer this: run models on a canonical small sample (the ~3k Project C sample), split the training pool into stratified batches, then evaluate model performance after each batch to observe whether metrics improve (learning) or remain poor (insufficient data). See src/incremental_benchmarking/run_incremental_benchmark_all_models.py, src/cumulative_training/sf_ny_data/BENCHMARK_SUMMARY.md, and docs/incremental_benchmarking/INCREMENTAL_FINDINGS.md for the experimental code, numeric summaries, and curated conclusions.

Both cumulative training and incremental benchmarking intentionally reuses the models_v2 featurization and transform contracts so comparisons are fair. See src/models_v2/shared_featurizer.py for the featurizer expectations.

Current Status

shared v2 modeling foundation
- reusable modeling code, featurizers, feature bundles, configs, and evaluation helpers
- main code: src/models_v2/README.md
ceiling study
- active
- purpose: test whether low/medium-cost schema-native features could reach production-level performance on the provided Project C sample split
- data scope: train/holdout evaluation on the provided Project C sample data, not a separately constructed dataset
- current confirmed diagnostic leader: RandomForest, single, v2_rf_single_no_spatial_prior
- main docs: docs/README.md
- main artifacts: artifacts/README.md
incremental training / benchmarking
- active
- purpose: evaluate incremental warm-start workflows (persisted model updates across data releases) and dataset-size sufficiency for stable closed-class learning
- findings summary: see docs/incremental_benchmarking/INCREMENTAL_FINDINGS.md and docs/cumulative_training/README.md
- current landing page: docs/incremental_benchmarking/README.md
- main areas:
  - src/incremental_benchmarking/
  - src/cumulative_training/
archive / historical
- older v1 modeling, labeling, label-coverage, and exploratory work kept for reference

Where To Start

If you want the current ceiling-study path:
- docs/README.md
- src/models_v2/README.md
- artifacts/README.md
- use this path if you want the main answer to: can low/medium-cost schema-native features on the Project C sample data reach the production gate, or would this likely require more data or more expensive features?
If you want the repo-wide status map:
- docs/WORKSTREAMS.md
If you want the incremental-training / benchmarking work:
- docs/incremental_benchmarking/README.md
- src/incremental_benchmarking/
- src/cumulative_training/
- high-level motivation: evaluate whether warm-start workflows can match full-retrain performance and whether the provided (~3k) sample is sufficient to learn a stable closed-class signal

Repository Layout

src/models_v2/
- shared v2 modeling foundation used by the active workstreams
src/incremental_benchmarking/
- current incremental benchmarking work
src/cumulative_training/
- current cumulative-training work
src/archive/models/
- older v1 modeling code kept mainly for reference
docs/
- repo navigation, protocol, rationale, and results summaries
artifacts/
- generated outputs across workstreams
data/
- train/val/test parquet splits and supporting files

Setup

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Ceiling-Study Docs

Use these for the current schema-native v2 study.

This is the main documentation path for the workstream that tests whether the current low/medium-cost schema-native feature setup can achieve production-level performance on the provided Project C sample data:

Incremental Workstream Docs

docs/incremental_benchmarking/README.md

Cumulative Training Docs

Use these for the cumulative / incremental training experiments and reproducibility:

docs/cumulative_training/README.md
- dataset-focused run instructions and quick findings for SF/NY experiments
docs/incremental_benchmarking/INCREMENTAL_FINDINGS.md
- curated findings, reproduction notes, and practical recommendations
src/cumulative_training/sf_ny_data/BENCHMARK_SUMMARY.md
- detailed numeric summary and timing notes for the SF/NY alex-filtered experiments
src/cumulative_training/sf_ny_data/run_incremental_benchmark_sf_ny.py
- dataset-specific driver that created the SF/NY incremental results
src/incremental_benchmarking/run_incremental_benchmark_all_models.py
- generic project_c-style incremental driver (per-batch vs single-run comparisons)

Notes

The top-level README is intentionally brief and repo-level.
Track-specific details should live in the track docs rather than in this file.

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
artifacts		artifacts
data		data
docs		docs
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
readme_project_c_samples.txt		readme_project_c_samples.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open/Closed Place Prediction

Team

Overview

Current Status

Where To Start

Repository Layout

Setup

Ceiling-Study Docs

Incremental Workstream Docs

Cumulative Training Docs

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Open/Closed Place Prediction

Team

Overview

Current Status

Where To Start

Repository Layout

Setup

Ceiling-Study Docs

Incremental Workstream Docs

Cumulative Training Docs

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages