Documents the "why" behind key design decisions so future contributors don't accidentally break invariants that look arbitrary.
Operational runs: Golden-path commands, checkpoint hygiene, and market_code discipline live in RUNBOOK.md.
Decision: Step 2 produces two separate parquet files from the same raw data.
Why:
- Centered smoothing (
causal=False) uses both past and future neighbors in each rolling window. This is mathematically superior for interpolating genuinely missing historical data and for characterizing what a regime "looks like" across its full span. Used for: clustering (step 3), regime profiling (step 4). - Causal smoothing (
causal=False) uses only past data in every rolling window. This exactly replicates what you could compute at the end of a quarter with only information available at that moment. Used for: supervised learning (step 5), live scoring (steps 5-7). - The critical invariant: if a feature is computed with a centered window, it contains information from the future. Training a supervised model on centered features and then trying to score "today's" data is look-ahead bias — the model learned patterns that cannot be reproduced in real-time.
Alternative considered: single file with a flag column — rejected because it leads to accidental mixing of centered and causal features when steps share files.
Column names are identical in both files. This is intentional: steps 5-7
import the causal file as a drop-in replacement. The checkpoint manager uses
"features" vs "features_supervised" keys to distinguish them.
Decision: n_pca_components: 5 in settings.yaml. Not "keep 90% variance".
Why:
- The legacy script established 5 components as the working baseline after experimenting with scree plots on the actual 69-column feature matrix.
- More importantly: changing the number of PCA components changes the clustering
geometry, which changes the cluster assignments, which invalidates any manually
pinned regime names in
config/regime_labels.yaml. - Variance-threshold PCA is non-deterministic across data updates (as more data arrives the cumulative variance curve shifts). Fixed components are reproducible.
When to revisit: if the feature set changes substantially (e.g., adding 20 new FRED series), re-run the scree plot and benchmark silhouette scores for 3, 5, 7, 10 components before changing. Document the new choice here.
Decision: We use balanced_cluster (from KMeansConstrained) for all downstream
steps, not cluster (from standard KMeans with best-k from silhouette).
Why:
- Per-regime statistics require sufficient samples to be meaningful. Standard KMeans often produces clusters of wildly different sizes (e.g., 70% in one cluster).
- With only ~300 quarters, a cluster of 10 quarters has unreliable mean/std estimates.
KMeansConstrained(size_min=bucket-2, size_max=bucket+2)ensures each regime has ~60 quarters at k=5, giving reliable statistics for all downstream computations.
Tradeoff: balanced clustering slightly distorts cluster geometry — some quarters near a boundary get assigned to a less-natural regime to meet the size constraint. This is acceptable: the goal is interpretable regimes with robust statistics, not geometrically pure clusters.
Alternative considered: hierarchical clustering — rejected because it doesn't produce equal-size clusters and has no clear stopping rule for k.
Decision: Gap fill happens AFTER log transform.
Why:
- The raw series (e.g., S&P 500, GDP) are exponential-looking. Interpolating between two values of 1000 and 2000 in linear space overshoots the midpoint. In log space, the midpoint of [log(1000), log(2000)] = log(1414) is correct.
- Bernstein polynomials require 4 boundary conditions per side (value, d1, d2, d3). All three derivatives must also be computed in log space for consistency.
- Invariant: the order is always: cross-ratios → log → select → gap-fill → derivatives → select. Do not move gap fill before log transform.
Why Bernstein polynomials (not linear/cubic spline)?
BPoly.from_derivativesproduces a polynomial that exactly matches value + first 3 derivatives at both endpoints. This makes the interpolation smooth and compatible with the derivative features we compute afterward.- Cubic splines minimize curvature globally; Bernstein interpolates boundary conditions locally. For gap filling (usually 1-4 quarters), local is better.
Decision: Use Taylor expansion (not Bernstein) for leading and trailing edge gaps.
Why:
- Bernstein requires boundary conditions on both sides. For edge gaps (missing data at the start or end of the time series), one side has no neighbors.
- Taylor extrapolation uses value + d1 + d2 + d3 at the known edge to project
outward:
f(x+h) ≈ f(x) + h·f'(x) + (h²/2)·f''(x) + ... - This is mathematically consistent with the interior Bernstein approach.
Decision: DataFrames always use parquet; sklearn models use pickle.
Why (parquet):
- Smaller files (columnar compression)
- Typed (dtypes are preserved)
- Human-inspectable (duckdb, pandas, parquet-viewer all read it)
- No Python version lock-in (unlike pickle for DataFrames)
Why (pickle for models):
- sklearn's model serialization format is pickle. There is no parquet-serializable
alternative for a fitted
RandomForestClassifier. - Risk: pickle files are Python-version-sensitive. If you retrain on Python 3.11
and score on Python 3.10, the pickle may fail. Mitigation: use
joblib.dumpwhich is slightly more stable than raw pickle for sklearn objects. (TODO: migrate frompickle.dumptojoblib.dumpinpipelines/05_predict.py)
Decision: fred_gdp and fred_gnp are shifted +1 quarter.
Why:
- The BEA releases the "advance estimate" of GDP approximately 30 days after quarter end. The "third estimate" (most revised) comes ~90 days later.
- At the end of Q1 you cannot know Q1 GDP. The first GDP reading you have at the end of Q1 is the Q4 (previous year) third estimate.
- Not shifting these series introduces look-ahead bias: the model learns to use "future" GDP data as if it were available today.
- This is set in
config/settings.yamlasshift: trueper series. - Invariant: all FRED series that are subject to significant revision and have a publication lag longer than one quarter should be shifted.
Decision: All runtime behavior is controlled by a single RunConfig object
passed through the pipeline, not by global variables or config file values.
Why:
- Avoids action-at-a-distance bugs where a deeply nested module checks a global flag that was set somewhere else
- Makes the pipeline deterministic and testable: pass
RunConfig(generate_plots=False)in tests to skip all matplotlib code without monkeypatching globals - The dataclass
from_args()factory method converts argparseNamespacetoRunConfiginrun_pipeline.py— the only place argparse is used
Decision: Always produce both cluster and balanced_cluster, even though
only balanced_cluster is used downstream.
Why:
cluster(unconstrained, best-k from silhouette) serves as a geometric reference: ifbalanced_clusterlooks very different fromcluster, it means the size constraint is distorting the natural clusters significantly.- Having both lets you visually compare in notebooks without re-running clustering.
- The k-sweep silhouette scores that determine
best_kforclusterare also saved (data/regimes/kmeans_scores.parquet) for elbow-curve visualization.
Decision: initial_features and clustering_features lists live in
config/settings.yaml, not hardcoded in Python.
Why:
- These lists were analytically determined by examining which series have coverage back to ~1950 and which derivatives are informative for clustering.
- Putting them in YAML lets you experiment without touching Python source code.
- Invariant: changing
clustering_featureschanges the clustering geometry and invalidates any manually pinnedregime_labels.yaml. Update the YAML file, delete the oldcluster_labels.parquetcheckpoint, and re-run steps 3-7 before committing.
Decision: Notebooks call functions from src/trading_crab_lib/plotting.py; they
do not define plotting logic inline.
Why:
- Reusability: the same plot is needed in the notebook AND in the CLI
--plotsmode. - Testability: plotting functions can be tested by mocking
matplotlib.pyplot. - Consistency: all plots use the same
CUSTOM_COLORSpalette and output naming scheme. - DRY: prevents three slightly-different versions of the same chart drifting apart.
Consequence: if you need a new plot, add it to plotting.py first, then call
it from the notebook. Never define a fig, ax = plt.subplots() block directly in
a notebook cell that isn't backed by a function.