This guide describes how ROBUST works internally, how its components fit together, and how to tune each stage to suit your dataset and runtime budget. It is intended for users who want to go beyond defaults and for developers who need to understand or extend the code.
- Pipeline overview
- Stage 1: Input validation and preprocessing
- Stage 2: Bootstrap stability selection
- Stage 3: Nested cross-validation
- Stage 4: Final model and cutoff
- Reproducibility by design
- Parallelism and runtime
- Memory usage
- Tuning for speed
- Tuning for rigor
- Algorithm internals
- Handling class imbalance
- Benchmark split methodology: BenchMake archetypal splits
- Understanding the scoring metrics
- Extending ROBUST
ROBUST runs five distinct phases:
Input X, y
|
v
[1] Validation + Preprocessing
|
v
[2] Nested CV with per-fold stability selection
| For each outer fold:
| - Preprocess training fold (impute + scale)
| - Stability selection: bootstrap n_bootstrap times,
| record which features are selected above threshold
| - Inner CV: RandomizedSearchCV on selected features only
| - Score on outer-fold test set (no preprocessing of test fold
| from training information)
|
v
[3] Final stability selection on all training data
|
v
[4] Final model fit on selected features (full training data)
|
v
[5] Cutoff determination (binary only): bootstrap controls
The critical design property is that the test fold is never seen during preprocessing, feature selection, or hyperparameter tuning within any fold. Each of those operations happens entirely on the training partition of that fold. This prevents optimistic bias and ensures that the nested CV score is an honest estimate of generalisation performance.
ROBUST validates inputs before running anything. Errors are raised immediately and clearly:
Xmust be a 2D numpy array or pandas DataFrame with at least 4 samples and at least 1 feature.- Infinite values in
Xraise an error (NaNs are allowed; infinities are not). - All-missing feature columns raise an error.
- Duplicate feature names raise an error.
- For classification: each class must appear in at least
min(outer_cv, inner_cv)samples. - Group labels must have the same length as
yand contain at least 2 distinct groups.
Preprocessing is a two-step sklearn Pipeline:
- Median imputation (
SimpleImputer(strategy="median")): replaces NaNs with the column median, computed from training data only. - Standard scaling (
StandardScaler()): applied only whenpreprocess="standard"orpreprocess="auto"withalg="eln". Tree methods (rf, xgb) do not require scaling.
The preprocessor is fitted independently inside each outer CV fold on the training partition. The final preprocessor is refitted on all training data after CV is complete. This means that imputation values and scaling statistics are different across folds and across the final model fit, all determined only from the data available at that stage.
When preserve_nans=False, ROBUST first runs _smart_drop_nans:
- For each combination of column-missingness threshold (0.1 to 0.9) and row-missingness threshold (0.1 to 0.9), it scores the retained region:
density * sqrt(retained_row_fraction) * sqrt(retained_col_fraction). This balances data completeness against sample retention. - The thresholds achieving the best score are selected.
- Columns and rows exceeding their respective thresholds are dropped.
- Remaining NaNs are handled by the per-fold median imputer.
Use preserve_nans=False when a substantial fraction of features or samples is very sparse. On the Graphene Oxide benchmark (1617 x 309 after dropping all-NaN and constant columns), this approach retains a consistent feature set for all folds and prevents dimension mismatches during nested CV.
For each of n_bootstrap iterations:
- Draw a stratified subsample of
sample_fraction * n_samplesrows (stratified for classification, random for regression). - Fit a fixed-configuration version of the chosen algorithm on the subsample (no hyperparameter search at this stage).
- Extract feature importances:
|coef_|for elastic net (max over classes for multiclass),feature_importances_for tree methods. - Mark as selected any feature with importance above the median of all importances for that bootstrap run.
- Accumulate selection counts.
After all bootstraps, divide counts by n_bootstrap to get selection frequencies in [0, 1]. Features with frequency >= stability_threshold are included in the final feature set.
During stability selection, no hyperparameter search is performed. Each algorithm uses a fixed configuration that balances speed with informativeness:
| Algorithm | Fixed config |
|---|---|
eln |
classification: C=1.0, l1_ratio=0.7; regression: alpha=0.01, l1_ratio=0.7 |
rf |
n_estimators=80, max_depth=10 |
xgb |
n_estimators=80, max_depth=6, learning_rate=0.05 |
rdg, las, log, svm, mlp, lin |
constructor defaults (no override applied) |
These values were chosen to give useful feature rankings without overfitting. The same configuration is used in every bootstrap run, so the only source of variation is the subsample.
Stability selection runs inside each outer fold, on the preprocessed training data for that fold. This means:
- Feature selection frequencies are computed from the fold's training data only.
- The test partition never influences which features are selected.
- Different folds may select slightly different feature sets; this is expected and its extent can be inspected via
nested_cv_result.selected_features_per_foldandnested_cv_result.feature_stability.
n_bootstrap |
stability_threshold |
Effect |
|---|---|---|
| 15 (fast/benchmark default) | 0.5 | Coarser frequencies (multiples of 1/15); threshold at 50% means selected in 8+ runs |
| 50 | 0.6 | Reasonable approximation for exploratory work |
| 100 (default) | 0.7 | Standard; selected in 70+ out of 100 bootstrap samples |
| 200 | 0.7 | Smoother frequencies; useful when n_features is large |
Fewer than 50 bootstraps can make the frequencies noisy enough that small random fluctuations determine inclusion. For publication-quality results, use n_bootstrap >= 100.
ROBUST implements a true nested design:
Outer fold split (outer_cv folds, repeated repeated_outer_cv times)
|
+-- Training data (outer fold) -->
| 1. Preprocess (fit on training data only)
| 2. Stability selection (bootstrap on training data)
| 3. Inner fold split (inner_cv folds) on selected features
| RandomizedSearchCV: n_iter hyperparameter configs
| scored by inner-fold cross-validation
| 4. Refit best config on full training partition
|
+-- Test data (outer fold) -->
5. Preprocess using training-fitted preprocessor
6. Select features using training-determined set
7. Score (AUC / weighted OVR AUC / neg-RMSE)
8. Store out-of-fold predictions
No information from the test fold touches steps 1-4. This is stricter than a simple train/test split: the hyperparameter search itself is cross-validated on the training data only.
| Dataset size | Recommended outer_cv | Recommended inner_cv |
|---|---|---|
| < 100 samples | 10 (leave-many-out) | 5 |
| 100 to 500 | 5 to 10 | 5 |
| 500 to 2000 | 5 | 3 to 5 |
| > 2000 | 5 | 3 |
More outer folds gives a more reliable performance estimate at the cost of more stability-selection and inner-CV runs. The total number of model fits scales as outer_cv * inner_cv * n_iter * repeated_outer_cv.
Setting repeated_outer_cv > 1 repeats the entire outer CV with a different random seed per repeat. Outer fold scores and out-of-fold predictions are averaged across repeats. This reduces the variance of the performance estimate for small datasets:
maker = RobustModelMaker(alg="eln", task_type="binary",
outer_cv=5, repeated_outer_cv=3, random_state=42)Each repeat uses seed random_state + repeat * 1000, ensuring different fold splits while remaining fully reproducible.
fs = maker.result_.nested_cv_result.feature_stability
# Columns: feature, mean_frequency, std_frequency, selected_in_n_folds
print(fs.head(10))A feature with selected_in_n_folds == outer_cv * repeated_outer_cv was selected in every fold of every repeat, indicating very high stability. High std_frequency suggests the feature's relevance is data-partition-dependent.
After nested CV is complete (and the performance estimate is fully established), ROBUST fits the final model on all training data:
- Fit the final preprocessor on all training data.
- Run stability selection on all training data.
- Select features at
stability_threshold. - Fit a final
RandomizedSearchCVover all training data with inner CV to find the best hyperparameters. - Refit the best estimator on all training data.
The final model is stored in result_.robust_model. It is the model used for predict() and predict_proba().
For task_type="binary", ROBUST determines a probability cutoff by bootstrapping the control-class (negative class) out-of-fold predictions:
- Collect the
outer_predictionsfor all samples with true label 0 (controls). - For each of
cutoff_n_bootstrapbootstrap resamples of the control scores, find thespec-th quantile (defaultspec=0.98, meaning the 98th percentile of control scores). - The final cutoff is the median of the bootstrap cutoff distribution. A 95% CI is also reported.
This cutoff ensures that approximately spec * 100% of controls score below it (i.e. target specificity), while maximising sensitivity. The cutoff is stored in result_.cutoff_result.cutoff_median.
For a different cutoff strategy, use:
from RobustModelMaker import determine_cutoff
# Use a different target specificity
cutoff = determine_cutoff(y_true, oof_scores, target_specificity=0.95)
# Apply manually
predictions = maker.predict(X_new, cutoff=cutoff.cutoff_median)ROBUST is designed to be fully deterministic given the same random_state:
- All random operations use explicit seeds derived from
random_statevia offsets (e.g.random_state + fold_idx,random_state + 10000 + bootstrap_idx). set_global_seed()setsnumpy.random.seedandPYTHONHASHSEED.- Each bootstrap run uses
random_state + 10000 + bso bootstrap sequences are independent of fold sequences. RandomizedSearchCVreceivesrandom_state=fold_seed, not a global RNG state.
The reproducibility test suite (tests/reproducibility_test_suite.py) verifies this with 30 tests covering:
- Identical selected features across two runs with the same seed.
- Identical fold scores (to machine precision).
- Identical stability frequencies.
- Different results with a different seed.
- Deterministic predictions on new data.
Caveats:
- Parallelism (
n_jobs != 1) can introduce ordering non-determinism with some sklearn versions on some platforms. For exact reproducibility, usen_jobs=1. - Floating-point accumulation order may differ across CPU architectures for tree methods.
ROBUST passes n_jobs to RandomizedSearchCV and to stability_selection (which passes it to the model). The total number of model fits is:
fits = outer_cv * repeated_outer_cv * (n_bootstrap + n_iter * inner_cv + 1)
+ n_bootstrap (final stability selection)
+ n_iter * inner_cv (final hyperparameter search)
+ 1 (final model)
For the benchmark defaults (outer_cv=5, inner_cv=2, n_bootstrap=15, n_iter=8, repeated_outer_cv=1):
5 * (15 + 8*2 + 1) + 15 + 8*2 + 1 = 5 * 32 + 32 = 192 fits
For production defaults (outer_cv=10, inner_cv=10, n_bootstrap=100, n_iter=100):
10 * (100 + 100*10 + 1) + 100 + 100*10 + 1 = 10 * 1101 + 1101 = 12,111 fits
Each "fit" involves training on a subsample of the data. Runtime is roughly linear in n_samples * n_features for linear models and super-linear for tree methods.
Peak memory occurs during RandomizedSearchCV with n_jobs > 1, when each worker holds a copy of the training fold. Approximate memory per worker:
bytes_per_worker ≈ n_samples * n_selected_features * 8 bytes (float64)
For large feature spaces, use n_jobs=1 to avoid duplication or set n_jobs to a smaller value (e.g. 2 or 4).
The preserve_nans=False mode can significantly reduce memory by dropping sparse columns and rows before the pipeline runs.
A fast configuration for exploration or CI:
FAST_KWARGS = dict(
outer_cv=3,
inner_cv=2,
n_bootstrap=15,
n_iter=8,
stability_threshold=0.5,
cutoff_n_bootstrap=100,
n_jobs=1,
verbose=False,
random_state=42,
)
maker = RobustModelMaker(alg="eln", task_type="binary", **FAST_KWARGS)The performance test suite (tests/performance_test_suite.py) uses a similar configuration and verifies that binary/multiclass/regression runs complete within a per-sample budget of 0.08 seconds (plus a fixed overhead), enforced on 120-sample synthetic datasets.
Trade-offs when reducing parameters:
| Reduction | What is lost |
|---|---|
Fewer n_bootstrap |
Coarser selection frequencies; more variance in which features are selected |
Fewer n_iter |
Coarser hyperparameter search; potentially worse final model |
Fewer inner_cv |
Noisier inner-fold score estimates; less reliable hyperparameter ranking |
Fewer outer_cv |
Wider confidence intervals on the performance estimate |
Lower n_jobs |
Longer wall-clock time, but lower memory and more reproducible |
A rigorous configuration for final results:
RIGOROUS_KWARGS = dict(
outer_cv=10,
inner_cv=10,
n_bootstrap=200,
n_iter=200,
stability_threshold=0.7,
repeated_outer_cv=3,
cutoff_n_bootstrap=2000,
n_jobs=-1,
random_state=42,
)For very small datasets (< 100 samples), consider:
# Leave-one-out outer CV with repeated splits to reduce estimate variance
maker = RobustModelMaker(
alg="eln", task_type="binary",
outer_cv=10, # use all available folds
repeated_outer_cv=5, # repeat 5 times to stabilise the estimate
n_bootstrap=200,
random_state=42,
)Stability selection: uses LogisticRegression(penalty="elasticnet", l1_ratio=0.7, C=1.0) for classification; ElasticNet(alpha=0.01, l1_ratio=0.7) for regression. The L1 component produces sparse coefficients, so the above-median importance threshold selects roughly the top half of features per bootstrap run.
Nested CV search space:
- Classification:
C ~ LogUniform(1e-4, 1e2),l1_ratio ~ Uniform(0, 1) - Regression:
alpha ~ LogUniform(1e-4, 1e2),l1_ratio ~ Uniform(0, 1)
Preprocessing: always scale (standard normalisation). Required because regularisation strength is not scale-invariant.
Feature importance: |coef_| for binary; max(|coef_|, axis=0) across classes for multiclass.
Stability selection: LogisticRegression(penalty="l2", C=1.0) for classification; Ridge(alpha=1.0) for regression. Uses |coef_| for importance ranking.
Nested CV search space: C ~ LogUniform(1e-4, 1e2) (classification) or alpha ~ LogUniform(1e-4, 1e2) (regression).
Preprocessing: L2 penalty is not scale-invariant; scaling is strongly recommended. With the default preprocess="auto", no scaling is applied for rdg; use preprocess="standard" to enable it.
Stability selection: LogisticRegression(penalty="l1", solver="saga", max_iter=5000, C=1.0, class_weight="balanced") for classification; Lasso(max_iter=10000, alpha=1.0) for regression. Strong L1 sparsity makes this a natural feature selector; most coefficients are zeroed, so the above-median threshold selects a small fraction of features per bootstrap run.
Nested CV search space: C ~ LogUniform(1e-3, 1e2) (classification) or alpha ~ LogUniform(1e-4, 10) (regression).
Preprocessing: L1 penalty is not scale-invariant; scaling is strongly recommended. With the default preprocess="auto", no scaling is applied for las; use preprocess="standard" to enable it.
Stability selection: LogisticRegression(penalty="l2", solver="lbfgs", max_iter=5000, C=1.0, class_weight="balanced"). Raises an error if used with task_type="regression".
Nested CV search space: C ~ LogUniform(1e-3, 1e2).
Preprocessing: L2 penalty is not scale-invariant; scaling is strongly recommended. With the default preprocess="auto", no scaling is applied for log; use preprocess="standard" to enable it.
Stability selection: SVC(kernel="linear", probability=True, C=1.0, class_weight="balanced") for classification; LinearSVR(C=1.0, max_iter=5000) for regression. Uses |coef_| for importance.
Nested CV search space: C ~ LogUniform(1e-3, 1e2).
Preprocessing: SVM margin is distance-based and requires comparable feature magnitudes; scaling is strongly recommended. With the default preprocess="auto", no scaling is applied for svm; use preprocess="standard" to enable it.
Stability selection: n_estimators=80, max_depth=10 with class_weight="balanced_subsample" for classification. Uses feature_importances_ (mean decrease in impurity).
Nested CV search space: n_estimators ~ Randint(100, 500), max_depth ~ Randint(2, 20), min_samples_split, min_samples_leaf, max_features.
Preprocessing: median imputation only (no scaling needed for trees).
Stability selection: n_estimators=80, max_depth=6, learning_rate=0.05. Uses feature_importances_ (weight-based gain).
Nested CV search space: n_estimators, max_depth, learning_rate ~ LogUniform(0.01, 0.3), subsample, colsample_bytree, reg_alpha, reg_lambda.
Preprocessing: median imputation only.
Stability selection: MLPClassifier or MLPRegressor with default architecture and max_iter=300. Feature importance is extracted from the first-layer weight matrix: mean(|coefs_[0]|, axis=1), which gives one importance value per input feature averaged over hidden units.
Nested CV search space: hidden_layer_sizes (16 or 32 units), alpha ~ LogUniform(1e-5, 1e-2), learning_rate_init ~ LogUniform(1e-4, 1e-2).
Preprocessing: neural networks are sensitive to feature magnitude; scaling is strongly recommended. With the default preprocess="auto", no scaling is applied for mlp; use preprocess="standard" to enable it.
Stability selection: LinearRegression(). Raises an error if used with classification task types. Uses |coef_| for importance.
Nested CV search space: fit_intercept fixed to True; a single deterministic fit per inner fold (no meaningful tuning).
Preprocessing: with the default preprocess="auto", no scaling is applied for lin. OLS coefficients are scale-dependent; if your features are on very different scales, use preprocess="standard".
ROBUST addresses class imbalance at two levels:
-
Stratified splitting:
StratifiedKFoldis used for all classification tasks, ensuring class proportions are preserved in each fold. -
Algorithm-level balancing:
- Elastic net (
eln), Ridge (rdg), Lasso (las), Logistic (log), SVM (svm), MLP (mlp):class_weightis not set by default. For severe imbalance these models rely on stratification and the AUC metric. - Random forest (
rf):class_weight="balanced_subsample"is always set, which weights samples inversely proportional to class frequency within each bootstrap sample of the tree. - XGBoost (
xgb): no automatic weighting, but the hyperparameter search coversreg_alphaandreg_lambdawhich help regularise minority-class patterns.
- Elastic net (
-
Stability subsampling:
_stratified_or_random_subsample()draws stratified subsamples for classification, preserving class proportions in each bootstrap run.
For extreme imbalance (< 5% minority class, as in the SECOM benchmark with ~7% failure rate), the benchmark results show that the combination of stratified splitting and class_weight="balanced_subsample" is sufficient for AUC-based evaluation. If recall or F1 on the minority class is the target metric, consider post-hoc threshold selection using determine_cutoff() with a lower target_specificity.
The benchmark suite uses BenchMake to partition each dataset rather than drawing random train/test splits. Understanding why this matters is essential for interpreting benchmark scores correctly.
BenchMake selects an archetypal partition: the train and test sets are chosen so that each is maximally representative of the full dataset's distribution in feature space. Rather than randomly assigning samples, it identifies samples that collectively cover the extremes and interior of the feature space and places them in each partition. The result is that train and test sets are deliberately kept apart in feature space, with minimal distributional overlap.
Random splits, especially stratified ones, tend to produce train and test sets that are statistically similar: the model sees representatives of nearly every region of the input space during training. BenchMake's archetypal splits break this by ensuring the test set contains samples that are structurally distinct from the training set. This simulates a harder generalisation scenario, closer to applying a model to a genuinely new population.
The consequence is that BenchMake benchmark scores are systematically lower than scores you would see with random splits on the same dataset. This is intentional: the benchmark is a conservative lower-bound assessment, not a typical-case estimate.
| Scenario | What to expect |
|---|---|
| You run ROBUST on your data with default random splits | Scores will typically be higher than the benchmark suite reports for the same algorithm |
| You compare your scores to the benchmark table | Lower benchmark scores do not mean ROBUST performs poorly on your data |
You use pytest benchmarks/benchmark_suite.py |
The pass/fail thresholds are calibrated for the adversarial split, not for random splits |
Both ROBUST and the full-feature baseline in the benchmark suite use the same BenchMake split for each dataset. The relative comparison (ROBUST vs. baseline) is therefore valid and fair: any difference in score reflects feature selection, not the split. Only the absolute scores are affected by the adversarial methodology.
| Task | Metric | Score range | Best value | Notes |
|---|---|---|---|---|
| Binary | ROC-AUC | 0 to 1 | 1.0 | 0.5 = random; reported as positive float |
| Multiclass | Weighted OVR ROC-AUC | 0 to 1 | 1.0 | Weighted by class frequency |
| Regression | Negative RMSE | -inf to 0 | 0.0 | Negated so sklearn maximisation applies |
Regression scores are always negative (they are negated RMSE, not RMSE itself). A score of -1.23 means the RMSE is 1.23 in the target's units. A less negative score (e.g. -0.8) is better than a more negative one (e.g. -1.5). When comparing ROBUST to a baseline, a positive delta means ROBUST had a smaller RMSE.
The mean_score and std_score attributes use this sign convention throughout. The floor_score parameter in the benchmark suite is set accordingly (e.g. floor_score=-8.0 for the Graphene Oxide dataset means ROBUST must achieve a mean neg-RMSE better than -8.0, i.e. RMSE < 8.0 eV).
- Add the algorithm code to the
AlgorithmLiteral type alias. - Add a branch to
get_algorithm_config()returning(estimator, param_distributions). - Add a fixed-configuration branch to
stability_selection()(theif alg == ...block that sets model parameters before bootstrapping). - Add a test to
tests/unit_test_suite.pyusing theCLASSIFICATION_ALGS/REGRESSION_ALGSlists.
from RobustModelMaker import stability_selection, nested_cross_validation
# Preprocess first (or let ROBUST handle it inside nested_cross_validation)
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
pre = Pipeline([("imp", SimpleImputer()), ("sc", StandardScaler())])
X_proc = pre.fit_transform(X_train)
# Run stability selection
stab = stability_selection(X_proc, y_train, feature_names=feature_names,
alg="eln", n_bootstrap=100, threshold=0.7)
selected = stab.selected_features
print(f"Selected: {selected}")The scoring metric is determined by _default_scoring(task_type). To use a different metric, pass the sklearn scorer string as a scoring argument to the underlying nested_cross_validation call or override _default_scoring directly. For permutation_importance, pass scoring= explicitly:
pi = maker.permutation_importance(X_val, y_val,
scoring="balanced_accuracy", n_repeats=20)