This file is read automatically by Claude Code at the start of every session. It explains what this project is, how to work in it, and what conventions to follow. Architecture decisions, pitfalls, and development history are all in this file — no separate ARCHITECTURE.md, DECISIONS.md, or PITFALLS.md exists.
Trading-Crab is a market regime classification and prediction pipeline written in Python.
The core idea: macro-economic time series (quarterly, ~1950–present) are used to label each calendar quarter with a "market regime" (e.g. Stagflation, Growth Boom, Rising-Rate Slowdown) using unsupervised clustering. Those labels then feed supervised models that (a) predict today's regime from currently-available data, (b) predict regime transitions 1–8 quarters forward, and (c) rank asset-class performance within each regime to produce portfolio recommendations.
End goal: a weekly automated report that says "current regime is X, these assets are green, hold / buy / sell."
The algorithm reference lives in legacy/unified_script.py — the original 1249-line
monolith that is ground truth for every formula, parameter choice, and pipeline order.
Do not modify any file in legacy/.
The modular pipeline in src/ and pipelines/ implements everything that script does,
organized more cleanly, with checkpointing, CLI flags, and dedicated plotting notebooks.
Reference submodules — This repo contains two Git submodules used as read-only
references. You may git pull / git submodule update to keep them current, but
never modify or push to them. Use them only to compare implementations and inform
changes to the main repo:
gsd-scratch-work/— GSD framework version of the project (earlier checkpoint)trading-crab-lib/— Separate trading-crab library repo
trading-crab/
├── CLAUDE.md ← you are here (all dev docs in one place)
├── README.md ← project overview (user-facing)
├── gsd-scratch-work/ ← READ-ONLY submodule (GSD framework version)
├── trading-crab-lib/ ← READ-ONLY submodule (trading-crab library repo)
├── ROADMAP.md ← prioritized feature backlog
├── STATE.md ← current pipeline status and known gaps
├── .env.example ← copy to .env, fill in FRED_API_KEY
├── pyproject.toml ← pip-installable package (src layout)
├── Makefile ← common dev shortcuts
│
├── config/
│ ├── settings.yaml ← ALL tuneable parameters live here
│ └── regime_labels.yaml ← manually-pinned regime names (edit after clustering)
│
├── data/ ← gitignored; created at runtime
│ ├── raw/ ← macro_raw.parquet, asset_prices.parquet
│ ├── processed/ ← features.parquet (after step 02)
│ ├── regimes/ ← cluster_labels.parquet, profiles.parquet, …
│ └── checkpoints/ ← timestamped parquet checkpoints (see CheckpointManager)
│
├── legacy/ ← reference implementation; do not modify
│ └── unified_script.py ← THE reference — all logic must be reachable here
│
├── notebooks/ ← plotting/exploration notebooks (one per pipeline stage)
│ ├── 01_ingestion.ipynb
│ ├── 02_features.ipynb ← gap-fill diagnostics, variance ranking, centered vs causal
│ ├── 03_clustering.ipynb ← PCA, GMM, DBSCAN, Spectral, gap stat, SVD comparison
│ ├── 04_regimes.ipynb ← regime stability, transition heatmaps, HMM comparison
│ ├── 05_prediction.ipynb ← CV diagnostics, model comparison, calibration, interpretability
│ ├── 06_assets.ipynb ← per-regime violin plots, Sharpe table, ETF coverage timeline
│ ├── 07_pairplot.ipynb ← triple-colored pairplots (unsupervised / grok / RF)
│ ├── 08_raw_series.ipynb ← raw series inspection
│ ├── 09_diagnostics.ipynb ← RRG scatter, rolling z-scores, quadrant rotation history
│ ├── 10_model_comparison.ipynb ← KMeans vs GMM vs HMM vs Spectral; soft probabilities
│ ├── 11_feature_selection.ipynb ← RF importance curves, dead-feature detector, what-if re-cluster
│ └── 12_divergence_momentum.ipynb ← divergence z-scores, momentum dashboard, trigger analysis
│
├── pipelines/ ← runnable pipeline steps
│ ├── 01_ingest.py
│ ├── 02_features.py
│ ├── 03_cluster.py
│ ├── 04_regime_label.py
│ ├── 05_predict.py
│ ├── 06_asset_returns.py
│ ├── 07_dashboard.py
│ ├── 08_diagnostics.py ← ratio diagnostics + RRG rotation view
│ └── 09_tactics.py ← per-asset buy_hold / swing / stand_aside
│
├── run_pipeline.py ← backward-compat shim; delegates to trading_crab.pipeline
│
├── requirements.txt ← pinned runtime dependencies (legacy; prefer pyproject.toml extras)
├── requirements-dev.txt ← runtime + dev extras (legacy; prefer pyproject.toml extras)
│
├── scripts/
│ ├── setup.sh ← automated environment setup
│ ├── jupyter_notebook_local.sh ← local notebook launcher helper
│ └── run_weekly_report.py ← weekly report automation (pipeline + archive + email)
│
├── tests/ ← pytest test suite (~769 tests)
│ ├── conftest.py ← shared fixtures (quarterly_index, raw_macro_df, etc.)
│ ├── fixtures/ ← test fixture data (currently empty)
│ ├── integration/
│ │ └── test_mini_pipeline.py ← synthetic end-to-end: steps 2-4, determinism regression
│ ├── test_pipeline_smoke.py ← trading_crab.pipeline smoke tests (build_parser, step dispatch)
│ ├── test_cli_smoke.py ← trading_crab.cli entry-point smoke tests
│ ├── test_pipelines_ingest_features.py ← pipeline steps 1-2 smoke tests
│ ├── test_models_regime.py ← regime classifier tests (bundle API)
│ ├── test_models_boosting.py ← GradientBoosting in bundle API
│ ├── test_models_interpret_tree.py ← interpretability helpers (feature ranking + reduced tree)
│ ├── test_models_behavior.py ← behavior model tests
│ ├── test_models_reporting.py ← metrics aggregation tests
│ ├── test_email_weekly.py ← email delivery + weekly report automation
│ ├── test_scripts_weekly_report.py ← weekly report script (archive, CLI, email)
│ ├── test_constraints_etf_universe.py ← ETF universe validation
│ ├── test_constraints_frequency.py ← data frequency validation
│ └── unit/ ← unit tests for src/ modules
│ ├── test_transforms.py ← engineer_all, gap-fill, derivatives, determinism
│ ├── test_clustering.py
│ ├── test_clustering_exploration.py ← GMM k-sweep, gap statistic, knee detection
│ ├── test_cluster_comparison.py ← pairwise ARI, RF feature importance
│ ├── test_gmm.py
│ ├── test_hmm.py ← GaussianHMM (requires hmmlearn)
│ ├── test_markov.py ← MarkovRegression (requires statsmodels)
│ ├── test_density.py ← DBSCAN + HDBSCAN (hdbscan optional)
│ ├── test_spectral.py
│ ├── test_checkpoints.py ← CheckpointManager + preservation checkpoints
│ ├── test_returns.py
│ ├── test_prediction_flat.py ← flat prediction API (RF, DT, predict_current)
│ ├── test_lightgbm.py ← LightGBM flat API (requires lightgbm)
│ ├── test_ingestion.py ← HTTP-mocked tests for multpl, FRED, assets
│ ├── test_macrotrends.py ← macrotrends.net scraper (mocked)
│ ├── test_diagnostics_rrg.py ← RRG analysis + rolling statistics
│ ├── test_tactics.py ← tactical asset classification
│ ├── test_config.py ← validate_config(), load_portfolio()
│ ├── test_regime.py ← regime profiling + transition matrix
│ ├── test_fred_series_config.py ← FRED settings.yaml validation
│ ├── test_yield_curve_features.py ← yield curve spread features
│ ├── test_divergence.py ← cross-asset divergence features
│ ├── test_momentum.py ← momentum + relative strength features
│ ├── test_indicators.py ← LEI proxy composite indicator
│ ├── test_evaluate_divergence.py ← divergence A/B evaluation script
│ ├── test_evaluate_momentum.py ← momentum A/B evaluation script
│ ├── test_forward_probabilities.py ← empirical forward transition matrices
│ ├── test_confusion_matrix_plot.py ← confusion matrix plotting helpers
│ ├── test_monitoring.py ← pipeline monitoring (steps 1-9)
│ ├── test_init_module.py ← env var path overrides + convenience imports
│ ├── test_reporting.py ← dashboard signals, portfolio, recommendations
│ ├── test_plotting.py ← all plot functions (steps 01–06 + diagnostics)
│ ├── test_runtime.py ← RunConfig defaults, from_args, str, logging
│ └── test_ingestion_completeness.py ← ingestion completeness report (P23)
│
├── outputs/ ← gitignored; created at runtime
│ ├── models/ ← pickled sklearn models
│ ├── plots/ ← saved figures (PNG/PDF)
│ └── reports/ ← dashboard.csv, weekly summaries
│
├── src/trading_crab/ ← app package (pip name: trading-crab)
│ ├── __init__.py ← version + package metadata
│ ├── cli.py ← CLI entry points (tradingcrab, tradingcrab-setup, tradingcrab-publish)
│ └── pipeline.py ← full pipeline orchestration (moved from run_pipeline.py)
│
└── src/trading_crab_lib/ ← library package (pip name: trading-crab-lib)
├── pyproject.toml ← independent pyproject.toml for library sdist
├── __init__.py ← defines ROOT, CONFIG_DIR, DATA_DIR, OUTPUT_DIR
├── config.py ← load() + validate_config(), load_portfolio(), setup_logging()
├── runtime.py ← RunConfig dataclass (verbose, plots, refresh flags)
├── checkpoints.py ← CheckpointManager (save/load/is_fresh/clear)
├── transforms.py ← ratios, log, select, gap-fill, derivatives, engineer_all
├── clustering.py ← reduce_pca, evaluate_kmeans, pick_best_k, fit_clusters
│ + optimize_n_components, compare_svd_pca,
│ + compute_gap_statistic, find_knee_k
├── gmm.py ← fit_gmm (returns scaler), select_gmm_k, gmm_labels, gmm_probabilities
├── hmm.py ← fit_hmm, select_hmm_k, hmm_labels, hmm_probabilities, hmm_transition_matrix
├── markov.py ← fit_markov_switching, markov_labels, markov_probabilities, compare_markov_kmeans
├── density.py ← knn_distances, fit_dbscan_sweep, fit_dbscan, fit_hdbscan_sweep, hdbscan_labels
├── spectral.py ← fit_spectral_sweep (affinity cached), spectral_labels
├── cluster_comparison.py ← compare_all_methods, pairwise_rand_index,
│ extract_rf_feature_importances, recommend_clustering_features
├── regime.py ← build_profiles, suggest_names, build_transition_matrix
├── asset_returns.py ← compute_quarterly_returns, returns_by_regime, rank_assets_by_regime
├── reporting.py ← asset_signals, print_dashboard, save_dashboard_csv, portfolio helpers
├── diagnostics.py ← RRG analysis: rolling_zscore, percentile_rank, normalize_100, compute_rrg
├── tactics.py ← tactical classification: compute_tactics_metrics, classify_tactics
├── email.py ← weekly email: load_email_config, build_weekly_email_body, send_weekly_email
├── divergence.py ← cross-asset divergence features: z-scores, triggers, derivative-space
├── momentum.py ← trailing momentum, relative strength, rolling correlation, CPI acceleration
├── indicators.py ← composite indicators: LEI proxy (UNRATE, T10Y2Y, M2SL, INDPRO, PAYEMS)
├── yield_curve_features.py ← yield curve spread features: 10Y-2Y, 10Y-3M from FRED + multpl
├── ingestion/
│ ├── __init__.py ← ingestion_completeness_report() + CompletenessReport dataclass
│ ├── multpl.py ← lxml scraper for multpl.com series
│ ├── fred.py ← FRED API fetcher with publication-lag shift
│ ├── assets.py ← yfinance ETF price fetcher (3-phase fallback)
│ ├── macrotrends.py ← macrotrends.net JSON scraper (gold, oil, silver back to 1915)
│ └── grok.py ← load external LLM-assisted quarter classifications
├── prediction/
│ ├── __init__.py ← FLAT API: train_current_regime(X,y,cfg), train_decision_tree,
│ │ train_lightgbm, train_forward_classifiers, predict_current
│ ├── classifier.py ← BUNDLE API with FoldReport + GradientBoosting + interpretability
│ │ helpers; backwards-compat layer for tests (see ADR #12 below)
│ └── gradient_boosting.py ← GradientBoostingClassifier helpers used by bundle API
├── plotting/ ← visualization package (re-exports from plotting/__init__.py)
│ ├── __init__.py ← re-exports all plot functions + CUSTOM_COLORS, REGIME_CMAP
│ ├── core.py ← _save_or_show, _regime_color, _in_jupyter, load_or_generate
│ ├── ingestion.py ← plot_raw_series_coverage, plot_raw_series_sample (step 01)
│ ├── features.py ← plot_feature_correlations, plot_gap_fill_before_after,
│ │ plot_feature_variance_ranking, plot_centered_vs_causal (step 02)
│ ├── clustering.py ← plot_elbow_curve, plot_pca_scatter, plot_scree,
│ │ plot_silhouette_samples, plot_gmm_bic_surface (step 03)
│ ├── regime.py ← plot_regime_timeline, plot_transition_matrix,
│ │ plot_soft_probabilities, plot_forward_prob_evolution (step 04)
│ ├── prediction.py ← plot_feature_importance, plot_decision_tree,
│ │ plot_calibration_curve, plot_learning_curve (step 05)
│ ├── assets.py ← plot_asset_returns_by_regime, plot_regime_asset_heatmap (step 06)
│ └── diagnostics.py ← plot_rrg_scatter, plot_divergence_timeseries (steps 08-09)
└── monitoring/ ← pipeline monitoring package (re-exports from monitoring/__init__.py)
├── __init__.py ← re-exports all monitoring functions
├── ingestion.py ← validate_date_range, count_source_columns, format_completeness_table
├── features.py ← compute_feature_quality, FeatureQualityReport
├── clustering.py ← compute_regime_stability, format_method_comparison,
│ RegimeStabilityReport
├── prediction.py ← compute_cv_fold_scores, check_regime_probabilities, CVFoldReport
└── pipeline.py ← validate_step_output, PipelineHealthSummary, format_tactics_summary
This monorepo ships two independent PyPI packages:
| Package | pip name | Contents | Consumers |
|---|---|---|---|
src/trading_crab_lib/ |
trading-crab-lib |
All library code: transforms, clustering, prediction, reporting, plotting, ingestion | Other Python projects, notebooks, tests |
src/trading_crab/ |
trading-crab |
CLI entry points + pipeline orchestration | End users running the pipeline |
trading-crab depends on trading-crab-lib>=0.1.2. The library has no dependency on the app.
Optional extras (library): [ingestion], [plotting], [hmm], [clustering-extras], [boosting], [all], [dev].
Development install:
# Install both packages in editable mode with all extras
pip install -e "src/trading_crab_lib/[all,dev]"
pip install -e ".[dev]"
# Or with uv (workspace-aware, installs both automatically):
uv sync# Via CLI entry point (after pip install -e .):
tradingcrab --refresh --recompute --plots
# Or via backward-compat shim:
python run_pipeline.py --refresh --recompute --plotstradingcrab --steps 3,4,5,6,7 --plotspython pipelines/01_ingest.py
python pipelines/02_features.py
python pipelines/03_cluster.py
python pipelines/04_regime_label.py
python pipelines/05_predict.py
python pipelines/06_asset_returns.py
python pipelines/07_dashboard.py
python pipelines/08_diagnostics.py
python pipelines/09_tactics.py| Flag | Effect |
|---|---|
--refresh |
Re-scrape multpl.com + re-hit FRED API (slow, ~10 min) |
--recompute |
Recompute features from cached raw data (skips scraping) |
--plots |
Generate all matplotlib figures and save to outputs/plots/ |
--verbose |
Set logging level to DEBUG |
--steps 1,3,5 |
Run only the listed step numbers |
--no-constrained |
Skip k-means-constrained (if not installed) |
--market-code NAME |
Load market_code from grok, clustered, predicted, or any saved checkpoint |
--save-market-code |
After step 3, save balanced_cluster as market_code_clustered checkpoint |
--show-plots |
Call plt.show() in addition to saving (avoid in headless/CI) |
--weekly-report |
Archive weekly_report.md to dated copy + email_body.txt |
--refresh-preservation |
Rewrite *_secondary preservation checkpoints even if they exist |
--send-email |
Send weekly report via SMTP (requires config/email.local.yaml) |
pip install -e ".[dev]"
jupyter lab notebooks/# 1. Install both packages in editable mode with all extras
pip install -e "src/trading_crab_lib/[all,dev]"
pip install -e ".[dev]"
# Or with uv (workspace-aware):
# uv sync
# 2. Optional but recommended for balanced clustering
pip install k-means-constrained
# 3. Set FRED API key (free at fred.stlouisfed.org/docs/api/api_key.html)
cp .env.example .env
# edit .env: FRED_API_KEY=your_key_here
# 4. Verify
python -c "from trading_crab_lib.config import load; print(load()['data'])"
tradingcrab --help| Package | Purpose |
|---|---|
fredapi |
FRED macroeconomic data |
lxml |
Fast HTML parsing for multpl.com scraper |
yfinance |
ETF/equity price history |
scipy |
BPoly.from_derivatives for gap filling |
scikit-learn |
PCA, KMeans, RandomForest |
k-means-constrained |
Balanced-size clustering (optional) |
matplotlib / seaborn |
All visualization |
pyarrow |
Parquet checkpoint I/O |
Every pipeline step checks CheckpointManager.is_fresh(name) before recomputing.
Checkpoints are stored as parquet files under data/checkpoints/ with a manifest
tracking creation timestamp and config hash. Pass --refresh or --recompute to
force regeneration. This is the most important usability feature for day-to-day
development — scraping 46 URLs every run is ~10 minutes.
All runtime behaviour is controlled by a RunConfig dataclass (not hardcoded in
modules). Construct it once in run_pipeline.py or any pipeline step, and pass it
through. Key flags mirror the legacy script:
@dataclass
class RunConfig:
verbose: bool = False
generate_plots: bool = False
generate_pairplot: bool = False # seaborn pairplot (slow)
generate_scatter_matrix: bool = False # pandas scatter_matrix (slow)
refresh_source_datasets: bool = False # re-scrape multpl + FRED
recompute_derived_datasets: bool = False # recompute features from cached raw
save_plots: bool = True # save figures to outputs/plots/
show_plots: bool = False # plt.show() (use False in CI/headless)GDP (fred_gdp) and GNP (fred_gnp) are shifted +1 quarter in fred.py to prevent
look-ahead bias. The raw BEA release comes ~30 days after quarter end, so at the end
of Q1 you cannot know Q1 GDP. This is set per-series in config/settings.yaml
(shift: true).
- Cross-asset ratios (10 derived columns: div_yield2, price_gdp, credit_spread, etc.)
- Log transforms (23 columns → log_{col})
- Narrow to
initial_features(36 columns + market_code) - Bernstein polynomial gap filling (interior NaNs) + Taylor extrapolation (edges)
- Smoothed derivatives via
np.gradienton day-number time axis (d1, d2, d3 per column) - Narrow to
clustering_features(69 columns + market_code)
Steps 3 and 6 are controlled by initial_features and clustering_features lists in
config/settings.yaml. Edit those lists there — not in the Python code.
The legacy analysis established 5 PCA components as the working baseline.
n_pca_components: 5 in settings.yaml. Do not switch to variance-threshold
PCA without benchmarking first — it changes the cluster geometry.
fit_clusters() always returns both cluster (best-k from silhouette, capped at
k_cap) and balanced_cluster (size-constrained at balanced_k). Downstream
steps default to balanced_cluster for regime labeling because equal-size clusters
are better for per-regime statistics with limited data.
All visualization helpers live in src/trading_crab_lib/plotting.py. Notebooks import
from there — they do not define plotting logic inline. Every plot function accepts
run_cfg: RunConfig and honours save_plots / show_plots. Output filenames are
standardized as outputs/plots/{step}_{description}.png.
Five-regime color palette from the legacy script:
CUSTOM_COLORS = ["#0000d0", "#d00000", "#f48c06", "#8338ec", "#50a000"]Use plotting.REGIME_CMAP everywhere for consistency.
The prediction/ subpackage has two modules with deliberately different APIs:
-
prediction/__init__.py— flat API (production):train_current_regime(X, y, cfg)returns a single fittedRandomForestClassifier;train_decision_tree()returns aDecisionTreeClassifier;predict_current()returns{"regime": int, "probabilities": {...}}. Used byrun_pipeline.pyandpipelines/05_predict.py. Theoutputs/models/current_regime.pklfile contains a plain RF. -
prediction/classifier.py— bundle API (backwards-compat):train_current_regime(X, y, cv_splits=N)returns{"models": {"rf": ..., "dt": ...}, "cv_reports": {"rf": [FoldReport, ...], ...}, "labels": [...]}. Used only bytests/test_models_regime.pyandtests/test_models_reporting.pywhich assert on per-fold CV indices and aggregate classification-report metrics. Do not use from pipeline code.
See ADR #12 below for the full rationale.
Scraped via lxml cssselect from #datatable. All URLs and value_type metadata
are in config/settings.yaml under multpl.datasets. Do not hardcode URLs in Python.
Rate-limited to 2 seconds per request (RATE_LIMIT_SECONDS).
Current: GDP (shifted +1Q), GNP (shifted +1Q), BAA, AAA, CPI (CPIAUCSL), GS10, TB3MS, VIXCLS, UNRATE, M2SL, M2NS, GS2, T10Y2Y, T10Y3M.
Planned additions (see ROADMAP.md Tier 1):
- HOUST (housing starts), UMCSENT (consumer sentiment)
Requires FRED_API_KEY in .env. Free registration at fred.stlouisfed.org.
Gold spot price back to 1915, WTI crude oil back to 1946, silver, copper.
See ROADMAP.md Tier 1 item 1.5 and src/trading_crab_lib/ingestion/macrotrends.py (to be created).
Scraping approach: extract embedded JSON from <script>var rawData={...}</script> tags.
SPY, GLD, TLT, USO, QQQ, IWM, VNQ, AGG — monthly adjusted close, resampled to
quarterly. Fetched in ingestion/assets.py. No API key required.
data/grok_quarter_classifications_20260216.pickle — an external LLM-assisted
classification of quarters used as a visual reference overlay in notebooks. Not used
for model training. Loaded via ingestion/grok.py (or directly in notebooks).
All tuneable parameters are in config/settings.yaml. Key sections:
| Section | Key parameters |
|---|---|
data |
start_date, end_date, frequency |
fred.series |
per-series name + shift flag |
multpl.datasets |
list of [name, description, url, value_type] |
features.log_columns |
columns to log-transform |
features.initial_features |
columns retained before gap fill |
features.clustering_features |
final columns fed to PCA |
features.derivative_window |
rolling mean window for np.gradient smoothing |
clustering.n_pca_components |
fixed at 5 |
clustering.n_clusters_search |
upper bound for k-sweep (default 12) |
clustering.k_cap |
max k accepted from silhouette (default 5) |
clustering.balanced_k |
k for size-constrained KMeans (default 5) |
prediction.forward_horizons_quarters |
[1, 2, 4, 8] |
prediction.cv_splits |
5 (TimeSeriesSplit folds) |
prediction.dt_max_depth |
8 (DecisionTree depth) |
prediction.rf_max_depth |
12 (RandomForest max depth) |
- The feature pipeline order — cross-ratios → log → select → gap-fill → deriv → select. The Bernstein gap fill must happen AFTER log transform so it interpolates in log space.
- Publication-lag shifts — GDP and GNP must always be shifted. Do not remove without explicit approval.
clustering_featureslist — this is analytically determined. Changes here change the clustering geometry and invalidate any manually pinnedregime_labels.yaml.n_pca_components = 5— changing this changes which regimes you find. Benchmark first.- Saving to
.envor committing API keys — never. Use.env.exampleonly. prediction/__init__.pyflat API —run_pipeline.py,pipelines/05_predict.py, andpipelines/07_dashboard.pyall expectcurrent_regime.pklto be a bareRandomForestClassifier. Do not change to the bundle-dict API without updating all three consumers. See ADR #12 below.- Reference submodules — no modifications, no pushes —
gsd-scratch-work/andtrading-crab-lib/are Git submodules for reference only. Pulling updates (git pull/git submodule update) is fine, but never modify files inside them or push to their remotes. Use them to compare implementations and inform changes to the main repo.
Cross-reference legacy/unified_script.py for ground truth on all algorithms.
All items are verified as matching in src/; see STATE.md for known gaps.
- Scraping — lxml
cssselect("#datatable tr"), user-agent string, 2s rate limit - FRED — per-series
shift, quarterly resample with.last() - Cross-ratios — exact 10 formulas (div_yield2, price_div, price_gdp, price_gdp2, price_gnp2, div_minus_baa, credit_spread, real_price2, real_price3, real_price_gdp2)
- Log transform —
np.log(col.clip(lower=1e-9)) - Gap filling —
BPoly.from_derivativeswith 4 boundary conditions per side (value + d1 + d2 + d3); Taylor extrapolation for leading/trailing edges - Derivatives —
np.gradienton matplotlib day-number axis + centered rolling mean of window=5 before and after each gradient call - PCA —
StandardScaler→PCA(n_components=5)→ re-StandardScalerbefore KMeans - K-sweep —
range(2, 13)withn_init=50, silhouette + CH + DB - Balanced clustering —
KMeansConstrained(size_min=bucket-2, size_max=bucket+2) - Color palette —
["#0000d0", "#d00000", "#f48c06", "#8338ec", "#50a000"]
- ✓ Real ETF price data via yfinance (16 ETFs) instead of macro-data proxies
- ✓
CheckpointManagerwith parquet + manifest (vs. ad-hoc pickle/CSV) - ✓
RunConfigdataclass for clean flag management - ✓ All config in
settings.yaml(vs. hardcoded Python constants) - ✓ Full CLI in
run_pipeline.pywith--steps,--refresh,--recompute, etc. - ✓ Dedicated exploration notebooks (01–08)
- ✓ Clustering investigation suite (GMM, DBSCAN, HDBSCAN, Spectral, gap statistic, SVD)
- Python 3.10+ — use
match,|union types,X | NonenotOptional[X] - Type hints on all public functions
loggingeverywhere, noprint()in library code (only inpipelines/andrun_pipeline.py)- No bare
except:— always catch specific exception types - All file paths via
pathlib.Path, never string concatenation
- DataFrames: noun describing contents (
features,pca_df,clustered,returns) - Series: noun describing the single variable (
labels,cluster) - Functions: verb_noun (
fetch_all,apply_log_transforms,build_profiles) - Config keys:
snake_casethroughout YAML
- Stored under
data/checkpoints/{name}.parquet(DataFrames) or{name}.pkl(models) - Always prefer parquet over pickle for DataFrames (smaller, typed, readable)
- Pickle only for sklearn models (no parquet-serializable alternative)
- Never commit data files —
data/andoutputs/are in.gitignore
pytest tests/ -vTests live under tests/. Unit tests should not require network access — mock
requests.get for scraping tests and FRED API calls. Use fixtures from tests/conftest.py.
- Conventional format:
feat:,fix:,refactor:,docs:,test:,chore: - Example:
feat: add yfinance asset price ingestion (step 06) - Branch: always
claude/description-sessionID— never push directly tomain
See STATE.md for a full breakdown of what runs, what's tested, and what output
files are produced. See ROADMAP.md for prioritized feature backlog.
Summary: all 9 pipeline steps run end-to-end on real data. 556 tests collected
(10 skipped: HDBSCAN + cssselect optional). All 5 legacy alignment gaps closed.
Clustering investigation suite (GMM, DBSCAN, Spectral, gap statistic, SVD) fully
implemented. Phase 3 supervised models (RF + DT + GB + forward classifiers) implemented.
New modules: diagnostics (RRG), tactics, email/weekly report. FRED expanded from 7
to 14 series; yield curve features added. ETF universe expanded from 16 to 38.
Diagnostics and tactics integrated as pipeline steps 8-9. Weekly report flow with
--weekly-report + --send-email CLI flags. Interpretability tree in step 5.
regime.pynaming heuristics silently skip 4 features (10yr_ustreas,fred_gs10,fred_tb3ms,div_minus_baa) because only their derivatives are inclustering_features. Graceful fallback is intentional.- ETF data starts 1993-2006; pre-1993 gold and oil regime analysis uses proxy columns only. macrotrends.net backfill would extend coverage to 1915+ for gold.
- Clustering uses KMeans which treats each quarter independently; HMM would model temporal autocorrelation natively (Tier 2 roadmap item).
- Standalone
pipelines/*.pyscripts do not useRunConfigorCheckpointManager— they are simplified entry points without plot generation or checkpoint management. Userun_pipeline.py --steps Nfor full-featured single-step execution. diagnostics.pyandtactics.pyare not yet integrated intorun_pipeline.pysteps; they are available as library modules for notebooks and custom scripts.email.pyrequiresconfig/email.yaml(not committed; add to.env.examplepattern).
# Check what checkpoints exist
ls data/checkpoints/
# Run just the clustering step with plots
python run_pipeline.py --steps 3 --plots --verbose
# Reload raw data from pickles (skip re-scraping) and recompute everything
python run_pipeline.py --recompute --plots
# Start fresh (re-scrape multpl + FRED, recompute all)
python run_pipeline.py --refresh --recompute --plots
# Launch notebooks
jupyter lab notebooks/
# Quick sanity check (no network, loads a checkpoint)
python -c "
from trading_crab_lib.checkpoints import CheckpointManager
cm = CheckpointManager()
print(cm.list())
"
# Print current dashboard (requires steps 01-06 to have run)
python pipelines/07_dashboard.pyDocuments the "why" behind key design decisions so future contributors don't accidentally break invariants that look arbitrary.
Step 2 produces two separate parquet files from the same raw data.
- Centered smoothing (
causal=False) uses both past and future neighbors in each rolling window. Superior for interpolating genuinely missing historical data and characterizing what a regime "looks like" across its full span. Used for: clustering (step 3), regime profiling (step 4). - Causal smoothing (
causal=True) uses only past data in every rolling window — exactly what you could compute at the end of a quarter with only information available at that moment. Used for: supervised learning (step 5), live scoring (steps 5-7). - Critical invariant: training a supervised model on centered features and then scoring "today's" data is look-ahead bias — the model learned patterns that cannot be reproduced in real-time.
Column names are identical in both files (intentional). The checkpoint manager uses
"features" vs "features_supervised" keys to distinguish them.
Rejected alternative: single file with a flag column — leads to accidental mixing of centered and causal features when steps share files.
n_pca_components: 5 in settings.yaml. Not "keep 90% variance".
- The legacy script established 5 components as the working baseline after experimenting with scree plots on the actual 69-column feature matrix.
- Changing the number of PCA components changes the clustering geometry, which changes cluster
assignments, which invalidates any manually pinned regime names in
config/regime_labels.yaml. - Variance-threshold PCA is non-deterministic across data updates (as more data arrives the cumulative variance curve shifts). Fixed components are reproducible.
When to revisit: if the feature set changes substantially, re-run the scree plot and benchmark silhouette scores for 3, 5, 7, 10 components. Document the new choice here.
We use balanced_cluster (from KMeansConstrained) for all downstream steps, not cluster
(from standard KMeans with best-k from silhouette).
- Per-regime statistics require sufficient samples to be meaningful. Standard KMeans often produces clusters of wildly different sizes (e.g., 70% in one cluster).
- With only ~300 quarters, a cluster of 10 quarters has unreliable mean/std estimates.
KMeansConstrained(size_min=bucket-2, size_max=bucket+2)ensures each regime has ~60 quarters at k=5, giving reliable statistics for all downstream computations.
Tradeoff: balanced clustering slightly distorts cluster geometry — some quarters near a boundary get assigned to a less-natural regime to meet the size constraint. Acceptable: the goal is interpretable regimes with robust statistics, not geometrically pure clusters.
Rejected alternative: hierarchical clustering — doesn't produce equal-size clusters and has no clear stopping rule for k.
Gap fill happens AFTER log transform.
- Raw series (e.g., S&P 500, GDP) are exponential-looking. Interpolating between 1000 and 2000 in linear space overshoots. In log space, the midpoint of [log(1000), log(2000)] = log(1414).
- Bernstein polynomials require 4 boundary conditions per side (value, d1, d2, d3). All three derivatives must also be computed in log space for consistency.
- Invariant: the order is always: cross-ratios → log → select → gap-fill → derivatives → select. Do not move gap fill before log transform.
Why Bernstein (not cubic spline)? BPoly.from_derivatives exactly matches value + first 3
derivatives at both endpoints — smooth and compatible with derivative features computed afterward.
Cubic splines minimize curvature globally; Bernstein interpolates boundary conditions locally.
For gap filling (usually 1-4 quarters), local is better.
Use Taylor expansion (not Bernstein) for leading and trailing edge gaps.
Bernstein requires boundary conditions on both sides. For edge gaps (missing data at the start or end of the time series), one side has no neighbors. Taylor extrapolation uses value + d1 + d2
- d3 at the known edge to project outward:
f(x+h) ≈ f(x) + h·f'(x) + (h²/2)·f''(x) + ...This is mathematically consistent with the interior Bernstein approach.
Parquet for DataFrames: smaller files (columnar compression), typed (dtypes preserved), human-inspectable (duckdb/pandas/parquet-viewer), no Python version lock-in.
Pickle for sklearn models: sklearn's serialization format is pickle; no parquet-serializable
alternative exists for a fitted RandomForestClassifier. Risk: pickle files are Python-version-
sensitive. Mitigation: use joblib.dump which is slightly more stable. (TODO: migrate from
pickle.dump to joblib.dump in pipelines/05_predict.py)
fred_gdp and fred_gnp are shifted +1 quarter.
The BEA releases the "advance estimate" of GDP approximately 30 days after quarter end; the
"third estimate" (most revised) comes ~90 days later. At the end of Q1 you cannot know Q1 GDP.
Not shifting introduces look-ahead bias. This is set in config/settings.yaml as shift: true
per series. Invariant: all FRED series with significant revision history and a publication
lag longer than one quarter should be shifted.
All runtime behavior is controlled by a single RunConfig object passed through the pipeline,
not by global variables or config file values.
- Avoids action-at-a-distance bugs where a deeply nested module checks a global flag set elsewhere.
- Makes the pipeline deterministic and testable: pass
RunConfig(generate_plots=False)in tests to skip all matplotlib code without monkeypatching globals. - The dataclass
from_args()factory converts argparseNamespacetoRunConfiginrun_pipeline.py— the only place argparse is used.
Always produce both cluster and balanced_cluster, even though only balanced_cluster
is used downstream.
cluster(unconstrained, best-k from silhouette) serves as a geometric reference: ifbalanced_clusterlooks very different, the size constraint is distorting natural clusters.- Having both lets you visually compare in notebooks without re-running clustering.
- The k-sweep silhouette scores that determine
best_kare saved (data/regimes/kmeans_scores.parquet) for elbow-curve visualization.
initial_features and clustering_features lists live in config/settings.yaml, not
hardcoded in Python. These were analytically determined by examining which series have coverage
back to ~1950 and which derivatives are informative for clustering. Putting them in YAML lets
you experiment without touching Python source code. Invariant: changing clustering_features
changes clustering geometry and invalidates regime_labels.yaml. Delete the old checkpoint
and re-run steps 3-7 before committing.
Notebooks call functions from src/trading_crab_lib/plotting.py; they do not define plotting logic
inline. Reasons: reusability (same plot needed in notebook AND CLI --plots mode), testability
(plotting functions can be tested by mocking matplotlib), consistency (same palette and naming),
DRY (prevents three slightly-different versions of the same chart drifting apart). If you need
a new plot, add it to plotting.py first, then call it from the notebook.
prediction/__init__.py (flat API) and prediction/classifier.py (bundle API) coexist in the
same package but serve different consumers and must not be conflated.
Context: During a GSD-assisted development session (March 2026), an alternative
pipelines/05_predict.py was generated using a "bundle" API returning a dict
{"models": {"rf": ..., "dt": ...}, "cv_reports": {...}}. This made it easy to write tests
asserting on per-fold CV metadata. However, adopting it as the production API would have required
simultaneous changes to:
run_pipeline.py(step5_predict) — imports and uses the flat APIpipelines/07_dashboard.py— loadscurrent_regime.pklassuming a bareRandomForestClassifier
Decision: keep the flat API in prediction/__init__.py as production. Create
prediction/classifier.py as a backwards-compatible layer for tests that need to inspect
per-fold FoldReport objects or aggregate classification-report dicts across folds.
Rules that must hold:
run_pipeline.pyand allpipelines/*.pyscripts import fromtrading_crab_lib.prediction(flat API).tests/test_models_regime.pyandtests/test_models_reporting.pyimport fromtrading_crab_lib.prediction.classifier(bundle API).outputs/models/current_regime.pklalways contains a bareRandomForestClassifier.- Do not "simplify" by merging the two modules — the bundle dict cannot be pickled as
current_regime.pklwithout breaking07_dashboard.py. - If you add a new classifier, add it to the flat API first. Only add bundle-API support in
classifier.pyif a test specifically needs per-fold CV metadata for the new model type.
A collection of traps, anti-patterns, and non-obvious failures discovered during development. Read before making changes.
P1. Using centered rolling windows for supervised learning
Symptom: model accuracy looks great but real-time predictions are wrong. Cause:
rolling(window=5, center=True) uses 2 future quarters in every window — a model trained on
centered features can only be scored on centered features, which requires knowing the future.
Fix: always use features_supervised.parquet (causal=True) for steps 5-7. features.parquet
(causal=False) is for clustering steps 3-4 only. Never swap them.
P2. Not applying publication-lag shifts to GDP/GNP
Symptom: model learns to use Q1 GDP to predict Q1 regime label. Fix: shift: true in
config/settings.yaml for fred_gdp and fred_gnp. Any FRED series with significant revision
history and a release lag longer than one quarter must be shifted. Check BEA release calendar.
P3. Using clustering labels as supervised training targets without alignment
Symptom: X and y have different lengths; .dropna() removes extra rows silently. Cause:
clustering runs on features.dropna() which may drop leading rows. Fix — always use index intersection:
common = features.index.intersection(labels.index)
X = features.loc[common].drop(columns=["market_code"], errors="ignore").dropna(axis=1, how="any")
y = labels.loc[common]Never use iloc[:len(labels)] — this silently misaligns if any rows were dropped.
P4. Using train_test_split (shuffled) for time-series data
Symptom: CV accuracy is 95%; production accuracy is 60%. Fix: always use
TimeSeriesSplit(n_splits=5). shuffle=False is not enough — you need TimeSeriesSplit
which ensures all training data precedes all test data in each fold.
P5. Forward-looking binary classifiers: label alignment
y_future = y.shift(-h) introduces NaN at the end. Current code does
y_future = y.shift(-h).dropna() and then X_aligned = X.loc[y_future.index]. This is correct.
Do not simplify to X.iloc[:len(y_future)].
P6. yfinance "self signed certificate in chain" error
assets.py sets CURL_CA_BUNDLE and SSL_CERT_FILE to certifi.where() at module load.
Do not remove those lines.
P7. multpl.com rate limiting
Never reduce RATE_LIMIT_SECONDS below 2. The --refresh flag should only be used when
genuinely needed. Use checkpoints for development iteration.
P8. X | Y union type syntax on Python < 3.10
Add from __future__ import annotations at the top of every module that uses X | Y syntax.
All src/trading_crab_lib/ files should have this.
P9. contourpy and other transitive deps failing on Python 3.10
requirements.txt uses >= minimum bounds (not exact pins) for direct dependencies only.
Never regenerate with pip-compile --generate-hashes.
P10. k-means-constrained compilation on some platforms
Use the --no-constrained flag which falls back to standard KMeans. The setup.sh script
prompts before attempting installation.
P11. Changing clustering_features invalidates regime_labels.yaml
After any change: (1) delete data/checkpoints/cluster_labels* and
data/regimes/cluster_labels.parquet, (2) re-run steps 3-4, (3) inspect new regime profiles
and update config/regime_labels.yaml, (4) commit the new YAML.
P12. end_date: "2025-09-30" in settings.yaml is hardcoded
Pipeline silently ignores data after that date. Fix: change to null and handle in
ingestion/fred.py and ingestion/multpl.py using datetime.today().
P13. Checkpoint freshness check uses wall-clock time, not data time
cm.is_fresh("macro_raw", max_age_days=7) returns True even if FRED released new data
yesterday. For production, always run with --refresh on Fridays. The weekly cron job
(Tier 3 roadmap) should always pass --refresh.
P14. Silhouette score selects k=2 when data is bimodal
Real macro data often has two dominant modes (growth vs recession) that score highest at k=2.
k_cap: 5 in settings.yaml caps the accepted k at 5. balanced_k: 5 forces 5 balanced
clusters regardless of silhouette result.
P15. PCA re-scaling before KMeans
PCA components are not unit-variance. StandardScaler must be applied AFTER PCA and BEFORE
KMeans. Invariant: features → StandardScaler → PCA(5) → StandardScaler → KMeans.
P16. plt.show() in headless environments
run_cfg.show_plots = False by default. Only set True via --show-plots locally.
CI/CD pipelines should never pass --show-plots.
P17. Seaborn pairplot is very slow on large feature sets
Pairplot with 69 features generates 69×69 = 4761 subplots. Disabled by default
(generate_pairplot: False in RunConfig). Enable only when specifically investigating
feature relationships.
P18. generate_recommendation() parameter order differs from legacy
Always call with keyword arguments:
generate_recommendation(target_weights=blended, current_weights=None)Never rely on positional argument order for this function.
P19. blended_regime_portfolio() probabilities must sum to ~1.0
Only use prediction["probabilities"] (from the multi-class RF) as input to
blended_regime_portfolio(). Forward classifier probabilities (binary, one per regime) are
independent binary classifiers that do NOT sum to 1.0 — they are not valid blending inputs.
P20. Running pytest no longer corrupts the macro_raw checkpoint — FIXED
tests/test_pipelines_ingest_features.py uses monkeypatch.setattr(step, "DATA_DIR", tmp_path)
to redirect all file I/O to pytest's temporary directory. No production checkpoint files are
touched during pytest.
P21. make_behavior_labels uses strict inequalities — exactly-at-threshold is "flat"
r > up_threshold and r < down_threshold (strict). With both thresholds at 0.0:
r > 0: "up" |r < 0: "down" |r == 0: "flat"
This is intentional. Do not change to >= / <= — the test suite verifies strict behavior.
P22. SSL verification is disabled in ingestion/assets.py
Uses curl_cffi.requests.Session(verify=False) unconditionally — susceptible to MITM on price
data. Planned fix: add a RunConfig / settings flag to control SSL verification, defaulting
to secure.
P23. Partial ingestion silently produces plausible-looking outputs
Ingestion failures are caught and logged at WARNING level but the pipeline continues with
whatever data was successfully fetched. Check macro_raw.parquet column count after ingestion;
should be ~53 columns. Planned fix: add ingestion completeness report.
P24. CheckpointManager.list() silently ignores corrupt metadata files
Catches all JSON parse errors without logging which file failed. Fix: log at WARNING which file failed to parse before continuing.
P25. Committed data artifacts in data/ can create stale-data bugs
data/fred_api_datasets_snapshot_20260216.pickle, data/multpl_datasets_snapshot_20260216.pickle,
data/grok_quarter_classifications_*.pickle — if the pipeline accidentally loads these instead
of freshly-fetched data, results are silently based on Feb 2026 snapshots. Planned fix: move to
data/archives/ with explicit documentation, or move small fixtures to tests/fixtures/.
P26. FRED ingestion hard-fails when FRED_API_KEY is missing
fred.py's fetch_all() calls fredapi.Fred(api_key=...) which raises if the key is None.
Fix: copy .env.example to .env and add your free key from fred.stlouisfed.org.
P27. Pickle files are an arbitrary-code-execution risk
outputs/models/current_regime.pkl and all other pickles execute arbitrary code on load.
Never load a pickle file whose provenance you cannot verify. Planned fix: migrate sklearn model
serialization to joblib.dump / joblib.load.
A chronological log of judgment calls that don't rise to the level of a formal ADR but are important for future contributors to know about.
The GSD-generated 05_predict.py (bundle API) was reviewed and rejected. The existing
pipelines/05_predict.py (flat API) is canonical. Adopting the GSD version would have required
simultaneous changes to run_pipeline.py and pipelines/07_dashboard.py with no immediate
benefit. What WAS adopted: the monkeypatch fix in pipelines/02_features.py — changed direct
import of engineer_all to a module-level reference so monkeypatch.setattr works in tests.
src/trading_crab_lib/prediction.py was converted to a package so new test files could import from
trading_crab_lib.prediction.classifier. Split: existing flat-API content moved intact to
__init__.py; new classifier.py created with bundle API. See ADR #12.
Changed r >= up_threshold / r <= down_threshold to r > up_threshold / r < down_threshold.
With both thresholds at 0.0, a return of exactly 0.0 was incorrectly classified as "up".
Impact: extremely rare on real price data; only affects synthetic test data.
GSD wrappers adding --refresh, --verbose, --market-code CLI flags to standalone pipeline
scripts were reviewed but not applied. run_pipeline.py --steps 1,2 already provides the same
functionality. Revisit if step1_ingest() and step2_features() are significantly changed.
tests/test_pipelines_ingest_features.py redirects all file I/O to pytest's tmp_path fixture
using monkeypatch.setattr(step, "DATA_DIR", tmp_path). No production checkpoint files are
written during pytest. The --recompute workaround after test runs is no longer needed.
These scripts represented an alternative pipeline design explored via the GSD framework.
They were deleted in commit bc3bc1b as they cluttered the repo. The decisions about which
changes were adopted are documented in D1 and D4 above.
legacy/unified_script.py is the algorithm ground truth. When implementing a remaining gap,
refer to unified_script.py, not any modular legacy files, to avoid inconsistencies.
Added VIXCLS, UNRATE, M2SL, M2NS, GS2, T10Y2Y, T10Y3M to config/settings.yaml.
All new series use shift: false (no publication-lag shift needed — these are released
with minimal delay). end_date changed from hardcoded "2025-09-30" to null (P12 fix).
New src/trading_crab_lib/yield_curve_features.py module with add_yield_curve_features().
Computes 10Y-2Y and 10Y-3M spreads from multpl.com treasury columns and/or FRED columns
(GS10-GS2, T10Y2Y, T10Y3M). Hooked into engineer_all() in transforms.py after
cross-ratios step. Does not affect clustering_features list — spreads are available for
analysis but must be explicitly added to the feature lists to influence clustering.
classifier.py now supports include_gb=True on train_current_regime() and
train_forward_classifiers(). Uses GradientBoostingClassifier (sklearn, not LightGBM)
for zero-dependency convenience. The flat API in prediction/__init__.py is NOT changed —
production still uses bare RF. GB is bundle-API-only for comparative testing.
extract_top_features(model, feature_names, top_k) ranks features by importance.
train_interpretability_tree(X, y, model, top_k, max_depth) trains a shallow DT on
only the most important features for human-readable decision rules. Both are in
classifier.py (bundle API side) since they're analysis tools, not production inference.
Three new library modules created from GSD Phase 6-8 designs:
-
diagnostics.py— Relative Rotation Graph (RRG) analysis.compute_rrg()classifies assets into LEADING/WEAKENING/LAGGING/IMPROVING quadrants based on relative strength and momentum vs benchmark. Also providesrolling_zscore(),percentile_rank(),normalize_100(). -
tactics.py— Tactical asset classification.compute_tactics_metrics()computes volatility, trend slope, and benchmark correlation.classify_tactics()assigns buy_hold / swing / stand_aside based on vol + trend thresholds. -
email.py— Weekly email delivery.load_email_config()readsconfig/email.yaml,build_weekly_email_body()composes from report files,send_weekly_email()sends via SMTP (TLS or SSL). Paired withscripts/run_weekly_report.pyfor full automation.
These are library modules only — not yet integrated as pipeline steps. Use from notebooks
or scripts/run_weekly_report.py.
Added 56 new tests covering previously untested modules:
config.load_portfolio()(4 tests),regime.py(5 tests), FRED config validation (1 test)- Flat prediction API (5 tests), GradientBoosting (2 tests), interpretability (2 tests)
- Ingestion HTTP-mocked tests: multpl (6), FRED (5), assets (4)
- Diagnostics/RRG (8 tests), tactics (7 tests), email/weekly report (15 tests)
- Yield curve features (2 tests)
Coverage gaps closed: prediction/__init__.py, config.load_portfolio(), regime.py,
all three ingestion modules, and three new modules.
Test coverage for previously untested modules — 68 new tests:
reporting.py(15 tests): dashboard signals, portfolio construction, recommendations, recommendation digest, weekly reportplotting.py(20 tests): all plot functions (steps 01–06),_save_or_show,_regime_color, constants, empty-input edge casesruntime.py(25 tests): defaults,from_args()with all flag combinations,apply_logging(),__str__()representation- Ingestion completeness report (8 tests): missing columns, high-NaN detection, summary formatting
P27 fix — pickle → joblib migration across 7 files:
checkpoints.py:save_model()/load_model()now usejoblib.dump/joblib.loadpipelines/05_predict.py,pipelines/07_dashboard.py,run_pipeline.py: all model serialization switched frompickletojoblibcluster_comparison.py: RF feature importance loading viajoblib.loadtests/unit/test_cluster_comparison.py: test fixture usesjoblib.dumprequirements.txt: addedjoblib>=1.3
P24 fix — CheckpointManager corrupt metadata logging:
is_fresh(): catchesjson.JSONDecodeError/KeyError/ValueErrorand logs WARNING with the specific file and error before returningFalselist(): catches all metadata parse errors and logs WARNING with file name
P23 fix — Ingestion completeness report:
- New
ingestion_completeness_report()insrc/trading_crab_lib/ingestion/__init__.py - Returns
CompletenessReportdataclass with missing columns, extra columns, high-NaN columns - Integrated into
pipelines/01_ingest.pyandrun_pipeline.pystep 1 - Builds expected column list from config (FRED + multpl + macrotrends)
Test count: 301 → 428 collected (11 skipped: HDBSCAN + cssselect optional). All previously untested modules now have test coverage.
Atomic rename of the Python package directory src/market_regime/ → src/trading_crab_lib/
plus ~438 import references across 89 files. pip package name: trading-crab-lib.
See RENAME_PLAN.md for the full rename strategy. market_code (a DataFrame column name)
was NOT renamed — it is a data concept, not a package reference.
Compared gsd-scratch-work/ and trading-crab-lib/ against the main
repo. Both submodules are earlier snapshots — the main repo is strictly ahead. No GSD-only
functionality needs porting. Key differences:
- GSD has 7 FRED series; main has 14
- GSD has 28 test files; main has 35+
- GSD lacks: LightGBM, macrotrends scraper, ingestion completeness report, forward probabilities, confusion matrix plot, diagnostics/tactics/email modules Submodules remain as read-only references only.
ROADMAP item 2.15 — Phases A+B complete. New src/trading_crab_lib/divergence.py module:
compute_rolling_correlation(): trailing Pearson correlation between signal pairscompute_divergence(): short-window vs long-window correlation departure (raw, abs, z-score)compute_divergence_triggers(): binary triggers when |z-score| > threshold, plus directioncompute_derivative_divergence(): divergence in d1 (derivative) space for leading indicatorsadd_divergence_features(): master wrapper, config-driven signal pairs and windows
Hooked into engineer_all() in two places: (1) level-space divergence after momentum features
(before log transforms), (2) derivative-space divergence after derivatives are computed.
Default pairs: SPY/TLT, SPY/GLD, GLD/Oil, CreditSpread/VIX. Config: features.divergence.
Per pair: 5 level columns + 3 derivative columns = 8 features.
29 tests in tests/unit/test_divergence.py.
Phases C (add to clustering/supervised feature lists) and D (evaluate impact) deferred.
ROADMAP item 2.12 implementation. New src/trading_crab_lib/momentum.py module with:
compute_trailing_momentum(): 2Q, 4Q, 8Q trailing returns for major seriescompute_relative_strength(): S&P-in-Gold, S&P-in-Oil, Gold-in-Oil ratioscompute_rolling_cross_correlation(): rolling 8Q correlation between signal pairscompute_inflation_acceleration(): 2nd derivative of CPI Hooked intoengineer_all()intransforms.py. Features available for analysis; must be explicitly added to feature lists insettings.yamlto influence clustering.
Phase C — Added divergence features to settings.yaml feature lists:
initial_features: addedsp500(raw level for derivative-space),div_spy_tlt_z_4q,div_spy_tlt_trigger,div_cred_vix_z_4q,div_cred_vix_triggerclustering_features: added 10 divergence columns (level z-scores + d1 derivatives + triggers + derivative-space z-scores + triggers for spy_tlt and cred_vix pairs)- Fixed
fred_vixcls→fred_vixcolumn name in DEFAULT_DIVERGENCE_PAIRS
Phase D — Evaluated impact via scripts/evaluate_divergence.py:
- Clustering improved: silhouette +0.032 (0.189→0.221), CH +6.8, DB −0.10. Improvement consistent across k=2–6 in sweep; strongest at k=5 (+0.046 silhouette)
- Supervised accuracy: −0.018 mean CV accuracy (within noise, 36.4%→34.6%). However,
div_cred_vix_z_4q_d1ranks 5th/80 in RF feature importance — the signal is there but may need feature selection or more data to improve CV generalization - Transition detection: SPY-TLT z-score is 36% higher at regime transitions (0.92) vs baseline (0.67) — confirmed as a leading indicator. Other pairs inconclusive with current data
- Recommendation: keep in clustering (clear win). For supervised, defer until gold/oil data activates additional pairs, or apply feature selection to prune noisy divergence columns
Architectural change: ETF price ingestion moved from step 6 to step 1. Prices are now
fetched alongside FRED/multpl/macrotrends data and cached as the asset_prices checkpoint.
Step 6 reuses cached prices instead of re-fetching (unless --refresh-assets is passed).
Why: ETF price derivatives (d1, d2) of major asset classes are informative for regime classification. Moving ingestion to step 1 makes ETF data available for step 2 feature engineering alongside macro data.
New data flow:
_fetch_and_cache_asset_prices(): extracted from step 6, now called in step 1_merge_asset_prices_into_raw(): merges a curated ETF subset (SPY, TLT, GLD, QQQ, VNQ) intomacro_rawasetf_spy,etf_tlt, etc. columns- Config:
features.asset_price_columnscontrols which tickers merge into macro_raw
Feature list additions (config/settings.yaml):
log_columns: addedgold_spot,wti_crude,etf_spy,etf_tlt,etf_gld,etf_qqq,etf_vnqinitial_features: addedlog_gold_spot,log_wti_crude,log_etf_*(available for supervised learning)clustering_features: addedlog_gold_spot_d1,log_gold_spot_d2,log_wti_crude_d1,log_wti_crude_d2
Key decision: ETF derivatives (e.g. log_etf_spy_d1) are intentionally NOT in
clustering_features. ETFs start 1993–2004 (~80 quarters), while clustering uses 305 quarters
back to 1950. Adding ETF derivatives to clustering features would force dropna() to discard
all pre-1993 rows, losing 55 years of regime history. Gold (1915+) and oil (1946+) from
macrotrends have enough history for clustering. ETF derivatives remain available for supervised
learning via features_supervised.parquet.
Divergence auto-activation: spy_gld and gld_oil divergence pairs (in
DEFAULT_DIVERGENCE_PAIRS) will now auto-activate once macrotrends data populates
gold_spot and wti_crude columns in macro_raw.
Phase C — Added momentum features to settings.yaml feature lists:
initial_features: addedsp500_mom_4q,sp500_mom_8q,10yr_ustreas_mom_4q,credit_spread_mom_4q,corr_sp500_10yr_ustreas_8q,cpi_accelerationclustering_features: added 11 momentum columns (raw rate-like values + d1 derivatives):sp500_mom_4q,sp500_mom_4q_d1,sp500_mom_8q,10yr_ustreas_mom_4q,10yr_ustreas_mom_4q_d1,credit_spread_mom_4q,credit_spread_mom_4q_d1,corr_sp500_10yr_ustreas_8q,corr_sp500_10yr_ustreas_8q_d1,cpi_acceleration,cpi_acceleration_d1- Fixed
fred_vixcls→fred_vixcolumn name indefault_mom_cols(line 210)
Phase D — Evaluation script scripts/evaluate_momentum.py:
- Same A/B methodology as divergence evaluation: compare clustering quality, supervised accuracy, and transition detection with/without momentum features
- 20 tests in
tests/unit/test_evaluate_momentum.py - Requires checkpoint data from pipeline steps 1-3 to run evaluation
All momentum features have deep history (1950+), so they are safe for
clustering_features without dropping pre-1993 rows.
Two new regime detection modules implementing ROADMAP items 2.9 and 2.13:
-
src/trading_crab_lib/hmm.py— GaussianHMM regime detection viahmmlearn. API mirrors GMM module:fit_hmm()sweeps k with best-of-N restarts, returns scores + models + scaler.select_hmm_k()picks best k via BIC.hmm_labels()returns Viterbi-decoded hard state assignments (canonicalized).hmm_probabilities()returns forward-backward posterior probabilities.hmm_transition_matrix()extracts the learned transition matrix. Key advantage over KMeans: models temporal autocorrelation — P(state_t | state_{t-1}) is estimated directly. -
src/trading_crab_lib/markov.py— Markov regime-switching viastatsmodels.MarkovRegression. Fits a switching-mean model on univariate macro series (e.g., GDP growth) for 2-state recession/expansion classification.compare_markov_kmeans()cross-tabulates Markov labels against KMeans regimes to answer "which KMeans regimes are recessions?"
Both modules are library-only (not integrated into pipeline steps). Use from
notebooks or comparison scripts. Both are optional dependencies — graceful
ImportError with install instructions if hmmlearn or statsmodels missing.
Tests skip via pytest.mark.skipif when libraries unavailable.
New dependencies: hmmlearn>=0.3, statsmodels>=0.14 (added to requirements.txt
and pyproject.toml).
37 new tests (19 HMM + 18 Markov). Total: 533 collected, all passing.
New src/trading_crab_lib/monitoring.py module with pipeline validation helpers:
-
C1.1 —
format_completeness_table(report): Enhanced formatting of the existingCompletenessReportwith a per-column NaN bar chart showing the worst offenders. Replaces plainreport.summary()in step 1 logging. -
C1.2 —
validate_date_range(df): Checks whether the DataFrame extends to the current quarter. ReturnsDateRangeReportwithquarters_behind, per-column staleness detection, and pass/fail status. Warns if data is >1 quarter behind or if individual series have stopped updating. -
C1.3 —
count_source_columns(df, cfg): Counts columns grouped by data source (FRED, multpl, macrotrends, ETF, other) using config to identify provenance. ReturnsSourceRowCountsdataclass with formatted summary. -
C1.4 —
compute_feature_quality(df): Computes NaN counts per column, top-5 highest-variance features, and top-5 highest-correlation pairs. ReturnsFeatureQualityReportwith formatted summary. Wired into step 2 in bothrun_pipeline.pyandpipelines/02_features.py. -
C1.5 — Gap-fill before/after plots:
_generate_gap_fill_plots()helper inrun_pipeline.pygeneratesplot_gap_fill_before_after()for 3 sample columns (log_sp500,log_us_cpi,log_10yr_ustreas) when--plotsis passed. Builds a pre-gap-fill snapshot by replaying cross-ratios → log → select without gap fill.
All monitoring wired into run_pipeline.py (steps 1-2) and standalone pipeline scripts.
23 tests in tests/unit/test_monitoring.py. Total: 556 collected, all passing.
Fixed email.py key mismatch: code now uses from_address/to_address (matching
config/email.example.yaml and GSD convention) instead of sender/recipients.
Env var support: Email config can now be set entirely via TC_* environment
variables without any YAML file. Env vars override YAML values when both are present.
Supported: TC_SMTP_HOST, TC_SMTP_PORT, TC_SMTP_USER, TC_SMTP_PASSWORD,
TC_EMAIL_FROM, TC_EMAIL_TO, TC_EMAIL_USE_TLS, TC_EMAIL_USE_SSL.
Strict validation: load_email_config() now validates required keys at load time
(fail-fast) instead of waiting until send_weekly_email() is called.
Weekly report guard: scripts/run_weekly_report.py now skips the email send
entirely when weekly_report.md doesn't exist, preventing the confusing error cascade.
Secrets protection: Added portfolio.local.yaml to .gitignore;
trading-crab-lib added to MANIFEST.in prune list.
Setup automation: scripts/setup.sh now copies email.example.yaml →
email.local.yaml as part of setup (GSD pattern).
21 tests in tests/test_email_weekly.py (8 new for env vars + validation).
New monitoring functions in monitoring.py + wiring into run_pipeline.py:
-
C2.1 — Scree + PCA loadings plots:
plot_scree()andplot_pca_loadings()wired intostep3_cluster()when--plotsis passed. -
C2.2 — Silhouette samples plot:
plot_silhouette_samples()wired intostep3_cluster()when--plotsis passed. -
C2.3 — Method comparison table:
format_method_comparison()inmonitoring.pyformats a clustering comparison DataFrame (method, k, silhouette, DB, CH) as a readable table. Compares KMeans (best-k) vs KMeans (balanced) viacompare_all_methods(). Logged at INFO +plot_method_comparison_table()on--plots. -
C2.4 — Regime stability summary:
compute_regime_stability()inmonitoring.pyextracts persistence probabilities from transition matrix diagonal, identifies most/least stable regimes, and computes average consecutive run length per regime. ReturnsRegimeStabilityReportdataclass. Wired intostep4_regime_label(). -
C2.5 — Feature-regime overlay plots:
plot_feature_regime_overlay()for 4 key indicators (log_sp500_d1,log_us_cpi_d1,credit_spread,10yr_ustreas_d1) wired intostep4_regime_label()when--plotsis passed.
10 new tests in tests/unit/test_monitoring.py (total: 33). Total: 566 collected, all passing.
New monitoring functions in monitoring.py + wiring into run_pipeline.py:
-
C3.1 — Per-fold CV accuracy table:
CVFoldReportdataclass andcompute_cv_fold_scores()run TimeSeriesSplit CV on fitted models (viasklearn.base.clone) and return per-fold accuracies. Wired intostep5_predict()for RF, DT, and LGBM (when available). Logged as formatted table with mean ± std. -
C3.2 — CV fold accuracy + decision tree plots:
plot_cv_fold_accuracy()for both RF and DT, plusplot_decision_tree()wired intostep5_predict()when--plotsis passed. -
C3.3 — Calibration curve + model comparison bar:
plot_calibration_curve()using RF'spredict_proba()output, andplot_model_comparison_bar()comparing RF vs DT (vs LGBM) mean CV accuracy. Wired intostep5_predict()when--plots. -
C3.4 — Forward probability evolution plot:
plot_forward_prob_evolution()wired intostep7_dashboard()when--plots. Usescompute_forward_probabilities()fromregime.pyto compute empirical forward transition matrices at horizons [1Q, 4Q, 8Q]. -
C3.5 — Dashboard QA gate:
check_regime_probabilities()inmonitoring.pywarns if any regime has <5% predicted probability (suspiciously low — may indicate model overconfidence or degenerate clustering). Wired intostep7_dashboard()beforeprint_dashboard().
9 new tests in tests/unit/test_monitoring.py (total: 42, 2 skipped without sklearn).
New monitoring functions in monitoring.py + wiring into run_pipeline.py:
-
C4.1 — RRG scatter plot:
plot_rrg_scatter()wired intostep8_diagnostics()when--plotsis passed and RRG data is available. -
C4.2 — Tactics summary:
format_tactics_summary()inmonitoring.pyformats a count of buy_hold/swing/stand_aside per asset with percentage bars. Wired intostep9_tactics(). -
C4.3 — Step output validation:
validate_step_output(step_num, outputs)checks DataFrame shape, NaN fraction per column (warns if >50%), and dtype presence. ReturnsStepValidationdataclass with pass/fail per check. Available as a library function for pipeline steps to call on their outputs. -
C4.4 — Step timing: Main loop in
main()now tracks elapsed time per step usingtime.monotonic(). Each step prints elapsed seconds on completion. -
C4.5 — Pipeline health summary:
PipelineHealthSummarydataclass tracks step timings, completed vs failed steps. Printed at the end of the pipeline run with a formatted table showing per-step timing and pass/fail status.
14 new tests in tests/unit/test_monitoring.py (total: 56, all passing).
C6.1 — Env var path overrides: __init__.py now checks TC_ROOT_DIR, TC_CONFIG_DIR,
TC_DATA_DIR, TC_OUTPUT_DIR environment variables at import time. If set, the env var
path wins; otherwise the default repo-relative path is used. Useful for Docker, CI, or
custom data directory layouts.
C6.2 — Convenience re-exports: trading_crab_lib.load(), trading_crab_lib.load_portfolio(),
trading_crab_lib.RunConfig, and trading_crab_lib.CheckpointManager are now accessible
directly from the package root. RunConfig and CheckpointManager use lazy __getattr__
to avoid circular imports at module load time.
C6.3 — pyproject.toml metadata: Added License :: OSI Approved :: MIT License and
Operating System :: OS Independent classifiers; added Changelog URL pointing to STATE.md.
C6.4 — CLI entry point: Deferred. python run_pipeline.py is sufficient for now.
C6.5 — Tests: 15 tests in tests/unit/test_init_module.py (1 skipped without joblib).
Covers: all 4 env var overrides, cascade behavior (TC_ROOT_DIR flows to DATA_DIR/OUTPUT_DIR
when individual vars are unset), precedence (TC_CONFIG_DIR overrides TC_ROOT_DIR-derived path),
_resolve_dir() helper, and all 4 convenience imports + invalid attribute error.
Preservation checkpoints are wide parquet snapshots (macro_raw_secondary,
features_secondary, features_supervised_secondary) that survive clear_all().
Purpose: downstream steps that drop sparse columns via dropna(axis=1) erase the
full column audit trail. Preservation checkpoints retain every column so you can
always inspect what was available before narrowing.
C7.1: PRESERVATION_CHECKPOINT_NAMES frozenset and preservation_checkpoint_should_write()
decision function in checkpoints.py. Write-once by default; only rewrites when
force=True (from --refresh-preservation flag).
C7.2: RunConfig.refresh_preservation_checkpoints field + --refresh-preservation
argparse flag in run_pipeline.py.
C7.3: Step 1 saves macro_raw_secondary after macro_raw checkpoint.
C7.4: Step 2 saves features_secondary and features_supervised_secondary after
the primary features and features_supervised checkpoints.
C7.5: clear_all() updated to skip preservation files by default. New kwarg
include_preservation=True removes them too. 10 new tests across test_checkpoints.py
(7 preservation tests) and test_runtime.py (3 for new flag).
Added 10 new cells (5 markdown + 5 code) to notebooks/02_features.ipynb:
-
D2.1: Gap-fill before/after overlays for
log_sp500,log_us_cpi,log_10yr_ustreas. Replays cross-ratios → log → select pipeline to build pre-gap-fill snapshot, then callsplot_gap_fill_before_after()for visual comparison. -
D2.2: Feature variance ranking bar chart via
plot_feature_variance_ranking(top_n=30). Identifies which features dominate PCA and which contribute little. -
D2.3: Centered vs causal comparison via
plot_centered_vs_causal_comparison()forlog_sp500_d1,log_us_cpi_d1,credit_spread_d1. Shows look-ahead effect at regime transitions where centered smoothing blurs boundaries. -
D2.4: Derivative magnitude distributions — 4×3 histogram grid (d1/d2/d3 for
log_sp500,log_us_cpi,credit_spread,log_cape_shiller). Shows std and kurtosis per panel to identify features with heavy-tailed dynamics. -
D2.5: Divergence & momentum feature correlation heatmap (seaborn). Flags pairs with |r| > 0.8 as redundancy candidates. Covers
div_*,*_mom_*,corr_*,cpi_accelerationcolumns.
Added 16 new cells (8 markdown + 8 code) to notebooks/03_clustering.ipynb. The notebook
already had 44 cells with extensive investigation (28 cells for GMM, DBSCAN, Spectral, gap
statistic, SVD). New cells add standardized plotting.py function calls and fill gaps:
D3a — PCA Diagnostics:
- D3a.1: Scree plot via
plot_scree()with 90% cumulative variance threshold - D3a.2: PCA loadings heatmap via
plot_pca_loadings(top_n=15)— top features × 5 components - D3a.3/D3a.4: Already existed (cells 17-18 for component sweep, cells 20-21 for SVD comparison)
- D3a.5: PC1×PC2 scatter with marginal KDE via seaborn
jointplot— reveals per-regime separation
D3b — Alternative Clustering Methods:
- D3b.1: GMM BIC surface via
plot_gmm_bic_surface()(official function vs inline plot in cell 27) - D3b.2/D3b.3/D3b.5: Already existed (cells 30-31 for DBSCAN, cell 34 for Spectral, cell 24 for gap stat)
- D3b.4: Method comparison table via
plot_method_comparison_table()— formatted table-as-figure
D3c — Cluster Quality Deep-Dive:
- D3c.1: Per-sample silhouette plot via
plot_silhouette_samples()— negative bars = misassigned quarters - D3c.2: 3D PCA scatter via
plot_regime_colored_pca_3d()— PC1×PC2×PC3 with regime colors - D3c.3: Regime duration histogram via
plot_regime_duration_histogram()+ run-length summary stats - D3c.4: Already existed (cell 38 for pairwise ARI heatmap)
Added 10 new cells (5 markdown + 5 code) to notebooks/04_regimes.ipynb:
-
D4.1: Feature-regime overlay for
log_sp500_d1,log_us_cpi_d1,credit_spread,10yr_ustreas_d1viaplot_feature_regime_overlay()— time-series with regime-colored bands. -
D4.2: Regime stability metrics via
compute_regime_stability()frommonitoring.py. Dual bar chart: persistence probability (P of staying) and average consecutive duration. -
D4.3: Forward transition probability heatmaps for 1Q/4Q/8Q horizons via
compute_forward_probabilities()andplot_forward_prob_evolution(). Prints highest off-diagonal transition per horizon. -
D4.4: Per-regime feature correlation heatmap via
plot_correlation_change_heatmap(top_n=12). Shows structural changes in feature relationships across regimes. -
D4.5: Empirical vs HMM transition matrix comparison (optional, requires
hmmlearn). Fits GaussianHMM with same k as KMeans, shows side-by-side heatmaps + absolute difference.
Added 20 new cells (10 markdown + 10 code) to notebooks/05_prediction.ipynb:
D5a — CV Diagnostics:
- D5a.1: CV fold accuracy bar chart via
plot_cv_fold_accuracy()— clones RF per fold - D5a.2: Per-fold confusion matrix grid — 5 side-by-side heatmaps (seaborn)
- D5a.3: Learning curve via
plot_learning_curve()— train vs test accuracy vs N - D5a.4: Per-fold class distribution table — pivoted train/test counts, flags folds with zero test samples for any regime
- D5a.5: Temporal accuracy by decade — bar chart showing accuracy per decade (1950s–2020s)
D5b — Model Comparison & Interpretability:
- D5b.1: Decision tree rendering via
plot_decision_tree(max_depth=4)— trained via flat API - D5b.2: Interpretability tree — shallow DT on top-10 RF features, prints
export_text()rules - D5b.3: Calibration curve via
plot_calibration_curve()— reliability diagram per regime - D5b.4: Model comparison bar — trains RF+DT+LGBM(optional), CV evaluates, plots grouped
accuracy/F1 comparison via
plot_model_comparison_bar() - D5b.5: Feature importance comparison via
plot_feature_importance_comparison()— side-by-side top-20 importances from all available model types
Added 10 new cells (5 markdown + 5 code) to notebooks/06_assets.ipynb:
-
D6.1: Per-regime violin plots for 6 key ETFs (SPY, TLT, GLD, QQQ, VNQ, AGG) showing full return distributions, not just medians. Uses seaborn
violinplotwith regime palette. -
D6.2: Regime-conditional Sharpe ratio table — annualized Sharpe (mean/std × sqrt(4)) per asset per regime. Styled DataFrame with RdYlGn color gradient.
-
D6.3: Best/worst asset per regime summary — shows highest and lowest median return plus the spread between them. Quick reference for portfolio tilts.
-
D6.4: Per-regime asset correlation matrices — top-10 ETFs by coverage, side-by-side heatmaps. Reveals crisis-regime correlation spikes vs normal diversification.
-
D6.5: ETF data coverage timeline — binary heatmap (green = data available) with decade markers and first-available-date summary per ETF.
Created notebooks/09_diagnostics.ipynb (13 cells) — new notebook for pipeline step 8
diagnostics and Relative Rotation Graph analysis.
-
D7.1: Setup + data loading (3 cells). Loads RRG data from
outputs/reports/diagnostics/, asset prices fromdata/raw/, withrun_step_if_needed()helper for prerequisites. -
D7.2: RRG 4-quadrant scatter via
plot_rrg_scatter(). Handles column name mismatch betweenrrg_for_benchmark()output (rs_ratio/rs_momentum) and plot function input (rs/rm) with rename. Falls back to on-the-fly computation from prices if saved data unavailable. -
D7.3: Rolling z-score time-series for config-driven ratios (Oil:Gold, Oil:Bonds, Bonds:Gold, Lumber:Gold). ±2σ bands with shaded extreme regions. Uses
rolling_zscore()fromdiagnostics.py. -
D7.4: Quadrant rotation history — stacked horizontal bar chart showing fraction of quarters each asset spends in LEADING/IMPROVING/WEAKENING/LAGGING quadrants. Sorted by LEADING frequency. Computes RRG quadrants per quarter using
normalize_100(). -
D7.5: Percentile rank dashboard — per-ratio histogram with current value marked, plus summary table with HIGH (>80th) / LOW (<20th) / NORMAL signal classification. Uses
percentile_rank()fromdiagnostics.py.
Created notebooks/10_model_comparison.ipynb (23 cells: 11 markdown + 12 code) — new
notebook comparing clustering methods and their soft probability outputs.
Part A — Hard Clustering Comparison (D8a):
-
D8a.1: Setup + data loading (4 cells). Loads features, computes PCA, loads KMeans labels, fits GMM/HMM/Spectral on same PCA space. HMM and Spectral gracefully skip if dependencies missing.
-
D8a.2: Side-by-side PCA scatter — dynamic N-panel layout (one per fitted method), PC1 vs PC2 colored by cluster assignment. Same palette across panels.
-
D8a.3: ARI pairwise matrix heatmap via
pairwise_rand_index()fromcluster_comparison.py. Seaborn heatmap with YlOrRd colormap. -
D8a.4: Temporal label agreement — rolling 8Q window of unique-label diversity across methods. Pairwise ARI summary. Notes that raw label matching ignores ID permutation.
-
D8a.5: Regime timeline comparison — N stacked horizontal timelines with per-method legend and shared x-axis.
Part B — Soft Probabilities (D8b):
-
D8b.1: GMM soft probabilities stacked area via
plot_soft_probabilities(). Reports mean max probability as sharpness metric. -
D8b.2: HMM soft probabilities stacked area via
plot_soft_probabilities(). Graceful skip ifhmmlearnnot installed. -
D8b.3: GMM vs HMM sharpness comparison — dual panel: Shannon entropy time-series
- max-probability histogram. Summary table with mean/median max prob, mean entropy, and % confident (>0.8).
-
D8b.4: Markov 2-state recession overlay — fits
fit_markov_switching()on best available macro derivative (GDP/CPI d1), identifies recession state by lower mean, overlays recession probability on KMeans regime timeline. Cross-tabulation table viacompare_markov_kmeans().
Created notebooks/11_feature_selection.ipynb (12 cells: 6 markdown + 6 code) — new
notebook for exploring which features matter most for regime classification.
-
D9.1: Setup + load RF model importances from
outputs/models/current_regime.pklviaextract_rf_feature_importances(). Also loads features checkpoint and KMeans labels. -
D9.2: Feature importance cumulative curve via
plot_feature_selection_curve(). Reports how many features are needed for 90% and 95% cumulative importance. -
D9.3: Recommended feature subset via
recommend_clustering_features(top_k=35). Shows full comparison table with kept/dropped status per clustering feature. -
D9.4: What-if re-clustering with top-35 features vs full set. Runs complete PCA + KMeans pipeline on both sets, compares silhouette scores with bar chart.
-
D9.5: Dead feature detector — flags features with < 0.5% importance. Horizontal bar chart with dead threshold line (red). Also lists clustering features not in the RF model (derivative-only features not used in supervised step).
Created notebooks/12_divergence_momentum.ipynb (12 cells: 6 markdown + 6 code) — new
notebook for exploring cross-asset divergence and momentum features.
-
D10.1: Setup + load features with auto-detection of divergence (
div_*) and momentum (*_mom_*,*_rs_*,acceleration,corr_*) columns. Reports counts of z-score, trigger, and momentum columns found. -
D10.2: Divergence z-score time-series via
plot_divergence_timeseries()with regime-transition vertical markers. Auto-detects_z_columns. -
D10.3: Momentum dashboard via
plot_momentum_dashboard()— grid of scatter plots colored by regime for all momentum/relative-strength columns. -
D10.4: Divergence trigger leading indicator analysis. For each trigger column, computes % of regime transitions preceded by a trigger firing in prior 1Q/2Q/4Q windows. Reports lift vs baseline trigger rate. Bar chart for 2Q lookback.
-
D10.5: Feature correlation heatmap (seaborn) of all divergence + momentum columns. Flags pairs with |r| > 0.8 as redundancy candidates.
Completed all 3 items from MONITORING_EXPANSION_PLAN.md Phase E:
-
E.1:
send_weekly_email()gains optionalplot_paths: list[Path] | Nonekwarg. When provided, builds multipart/related HTML email: plain-text alternative + HTML body with<img src="cid:plot_N">inline references +MIMEImageattachments with Content-ID headers. HTML body wraps report text in<pre>(XSS-safe viahtml.escape()), followed by a "Key Plots" section with one image per plot. Withoutplot_paths, behavior is unchanged (plain text, fully backward compatible). New helpers:resolve_plot_paths()resolves filenames to existingPathobjects (logs WARNING for missing files),_build_html_body_with_plots()generates the HTML. -
E.2:
config/email.example.yamlgainsattach_plots:key — a list of PNG filenames fromoutputs/plots/to embed inline. Default:03_regime_pca_scatter.png,05_cv_fold_accuracy.png,05_confusion_matrix.png,07_forward_prob_evolution.png,04_feature_regime_overlay.png. Set to[]or remove for plain-text-only email. -
E.3:
scripts/run_weekly_report.pyreadscfg.get("attach_plots", []), callsresolve_plot_paths()againstoutputs/plots/, prints attachment count, and passes resolved paths tosend_weekly_email(plot_paths=...).
11 new tests in tests/test_email_weekly.py (total: 30): resolve_plot_paths (3 tests),
_build_html_body_with_plots (2 tests), send_weekly_email with plots (4 tests).
1 existing test in tests/test_scripts_weekly_report.py updated for new kwarg.
This completes Phase 4 (Pipeline Monitoring & Notebook Expansion) — all phases A through E done.
Three independent sources of non-determinism eliminated:
Root cause 1 — market_code in gap-fill valid-row logic (transforms.py):
_fill_column() and apply_derivatives() previously used df[[col, "market_code"]].dropna()
to find valid rows. market_code NaN patterns differ by label source (--market-code grok
vs clustered vs predicted), so gap-fill boundaries changed between runs, altering all
downstream derivative values. Fix: use only df[[col]].dropna() — the feature column alone
determines valid rows. market_code is a label, not a feature.
Root cause 2 — No global numpy/random seed (pipeline.py:main()):
Individual sklearn models had random_state=42, but any stochastic operation not explicitly
seeded was non-deterministic. Fix: np.random.seed(seed) and random.seed(seed) are called
once at the start of main(), seeded from cfg["pipeline"]["random_state"] (default 42).
Root cause 3 — Silent dropna(axis=1) in step 5 (pipeline.py:step5_predict()):
Different NaN patterns (caused by root cause 1) produced different surviving column sets,
giving the RF different features across runs. Fix: log a WARNING listing every dropped column
so the user can see the variability. Long-term fix: pin the column list explicitly.
New pipeline.random_state config key in settings.yaml under [pipeline] section.
4 new determinism tests in test_transforms.py.
from __future__ import annotations added to transforms.py (was the last missing module).
statsmodels warnings from test_markov.py suppressed at two levels:
[tool.pytest.ini_options] filterwarningsinpyproject.toml— message-pattern filters that work even when statsmodels is not installed (avoidsPytestConfigWarning).pytestmarkintest_markov.pyupgraded from singleskipifto a list includingfilterwarningsmarks.
requirements-dev.txt updated with all optional deps required for a zero-skip run:
hmmlearn, statsmodels, hdbscan, lightgbm, lxml, cssselect, kneed. Each
annotated with which test file it unlocks.
README.md gains a "Running Tests" subsection with a table mapping packages to unlocked tests and a note on why statsmodels warnings are suppressed.
write_weekly_report_md() API extended with two optional kwargs:
diagnostics_df— DataFrame of ratio z-scores; top-5 by |z| shown under## Diagnosticswith direction (HIGH/LOW) tag.rrg_df— DataFrame[asset, quadrant] fromcompute_rrg(); LEADING/IMPROVING/WEAKENING/ LAGGING counts + leading asset names shown. Both are handled by new_append_diagnostics_section()helper. Fully backward-compatible — existing callers without these args see no change.
_markdown_to_html() added to email.py — stdlib-only markdown → HTML conversion for
the weekly report subset (##, #, **bold**, - list). XSS-safe via html.escape().
No external markdown dep (no markdown, mistune, or pygments required).
All emails now send multipart/alternative (plain + HTML). Previously the no-plots path
sent plain text only; now both paths include an HTML alternative so email clients always
render structured headings and lists rather than raw markdown syntax.
_build_html_body_with_plots() rewritten to use _markdown_to_html() instead of
<pre>-wrapping. HTML body is now readable in any email client without monospace overrides.
Workflow deduplication: Removed 3 of 6 GitHub Actions workflows:
python-app.yml— single-version (3.10 only) CI, strict subset ofpython-package.ymlpublish.yml— release-triggered publish via PYPI_API_TOKEN; superseded bypublish-app.ymlpython-publish.yml— unfinished GitHub boilerplate using OIDC trusted publishing; conflicted withpublish-app.ymlon the samerelease: publishedtrigger, causing duplicate uploads.
Retained: python-package.yml (multi-version CI, 3.10–3.13), publish-lib.yml (lib-v* tags),
publish-app.yml (v* tags). Each package now has exactly one publish trigger.
Mypy added: [tool.mypy] section in root pyproject.toml with ignore_missing_imports = true
and warn_unused_configs = true. Informational mypy step added to python-package.yml (exit-zero
for now; blocking mode deferred until type coverage improves). Roadmap toward warn_return_any,
warn_unreachable, disallow_untyped_defs documented in the config comment.
Pre-commit hooks: .pre-commit-config.yaml created with pre-commit-hooks (trailing whitespace,
EOF, YAML check, large-file guard), flake8 (syntax errors only, --select=E9,F63,F7,F82), and
mypy (on src/ only, --ignore-missing-imports).
Decision on mypy scope: Not strict yet because the codebase has many public functions with complete type hints but also internal helpers without annotations. Making CI fail on mypy now would block every PR. Incrementally enabling stricter settings is the correct approach.
STEPS dict holds direct function references (not names): STEPS in pipeline.py is built at
module import time as {1: ("desc", step1_ingest), ...}. Patching the module attribute
trading_crab.pipeline.step1_ingest replaces the name in the module namespace but does not
affect what STEPS[1] already points to. Tests that mock pipeline step dispatch must patch
the STEPS dict entries directly (replace STEPS[k] = (desc, mock)) and restore afterward.
cli.run_pipeline() imports main locally: main is imported inside run_pipeline()'s
function body (from trading_crab.pipeline import main). At module load time,
trading_crab.cli has no main attribute. Tests must patch at the source:
patch("trading_crab.pipeline.main"), not patch("trading_crab.cli.main").
Integration test design: tests/integration/test_mini_pipeline.py uses synthetic
_make_synthetic_macro(n_quarters=80) DataFrames — no file I/O, no network, no checkpoints.
The synthetic data includes all columns needed by add_cross_ratios() (sp500, dividend,
fred_gdp, fred_gnp, fred_baa, fred_aaa, etc.). Tests verify: engineer_all() produces valid
output, output is identical on repeated calls (determinism regression), output is independent of
market_code column values (root-cause L2 regression), centered ≠ causal (look-ahead guard
regression), PCA output is 5 components with no NaNs, clustering produces valid labels.
New test files: tests/test_pipeline_smoke.py (12 tests), tests/test_cli_smoke.py
(7 tests), tests/integration/__init__.py, tests/integration/test_mini_pipeline.py (14 tests).
Completed the final A3 type-hint gap across 6 files. All 193 public functions in
trading_crab_lib now have complete annotations (return types + parameter hints).
cls in classmethods is intentionally unannotated per Python convention.
Changes:
__init__.py:-> dictreturn types onload()andload_portfolio()wrappersmonitoring/prediction.py:model: objectincompute_cv_fold_scoresplotting/clustering.py:pca_obj: objectinplot_screeandplot_pca_loadingsplotting/prediction.py:model: object/tree: objectin 4 plot functionsprediction/__init__.py:-> objectreturn type ontrain_lightgbm(lgb optional dep)runtime.py:import argparse;args: argparse.Namespaceinfrom_args
Added validate_config(cfg) to src/trading_crab_lib/config.py:
- Checks all required top-level sections:
data,fred,multpl,features,clustering,prediction,assets,dashboard,pipeline,tactics - Validates types of 11 critical scalar keys (int/float/str) with full dotpath in error
- Collects all errors before raising — one
ValueErrorlists every issue at once - Called automatically at the end of
load()(fail-fast before any pipeline step runs) - Helper
_get_nested(cfg, dotpath)walks nested dicts via dot-separated paths - 8 new tests in
tests/unit/test_config.py(total: 12 including existing portfolio tests)
E1 — lib MANIFEST.in fixed: The recursive-include trading_crab_lib *.py py.typed line
was a no-op (no trading_crab_lib/ subdirectory inside src/trading_crab_lib/). Python
source files are found via setuptools package discovery (where = [".."]); py.typed is
covered by [tool.setuptools.package-data]. MANIFEST.in now contains only the metadata
include and exclusions with an explanatory comment.
E2 — CLAUDE.md layout tree updated:
- Notebooks list extended from 08 to 12 (added 09-12 with descriptions)
- Library tree updated:
plotting.py→plotting/package (9 submodules);monitoring.py→monitoring/package (5 submodules); addeddivergence.py,momentum.py,indicators.py,yield_curve_features.py,macrotrends.py,ingestion/__init__.py,prediction/gradient_boosting.py - Tests section updated from 571 to ~769 tests; added all new test files added since v0.1.2 (test_hmm, test_markov, test_lightgbm, test_divergence, test_momentum, test_indicators, test_macrotrends, test_pipeline_smoke, test_cli_smoke, integration/)
- This ADR log updated with D45–D47
E3 — README.md badge updated: tests-635%20passing → tests-769%20passing.
STATE.md new total updated to ~769.
trading_crab_lib.config.load() now accepts three input forms:
None(default) — readsconfig/settings.yamlfrom the repo root (backward-compatible).Path | str— reads from the given YAML file path.dict— accepts a pre-built config dict directly, bypassing all file I/O. Validation and FRED key injection still run.
This enables clean pip install trading-crab-lib usage without a git clone — callers
can construct the config programmatically and pass it to load(). Also useful for
Docker/CI environments where config is injected via environment variables.
5 new tests in TestLoadDictConfig (total test_config.py: 17).
Two-stage Dockerfile:
- Stage
base— Python 3.11-slim + system build tools + core library deps only. Useful as a lightweight base for custom downstream images. - Stage
pipeline(default) — extendsbasewith all optional extras ([ingestion,plotting,boosting]), thetrading-crabapp package, and thetradingcrabCLI entry point. Optionally installsk-means-constrained.
Runtime directories (/app/config, /app/data, /app/outputs) are pre-created
inside the image but designed to be overridden by bind mounts. All secrets pass
through environment variables (FRED_API_KEY, TC_SMTP_*, etc.) — none are baked
into the image. TC_CONFIG_DIR, TC_DATA_DIR, TC_OUTPUT_DIR are pre-set to
/app/config, /app/data, /app/outputs respectively, matching the expected volume
mount points.
.dockerignore excludes: .env, secrets, data/, outputs/, .venv,
gsd-scratch-work/, trading-crab-lib/, notebooks/, legacy/, and build artefacts.
Three services defined via YAML anchors (x-pipeline-base):
-
weekly-report— one-shot service (restart: no) that runstradingcrab --refresh --recompute --steps 1,2,3,4,5,6,7 --weekly-report --send-email. Designed for cron (0 7 * * 5 docker compose run --rm weekly-report) or GitHub Actions. -
pipeline— interactive runner withCMD ["--help"]override; usedocker compose run --rm pipeline --steps 3,4,5for ad-hoc step execution. -
notebook— overridesENTRYPOINTtojupyter laband exposes port 8888. Notebooks mounted from./notebookson the host so edits persist.
All three services share the x-pipeline-base anchor: same image build, same
env_file: .env, same volume mounts (./config:ro, ./data, ./outputs), same
TC_* path overrides.
README.md updated with a Docker quick-start section.