This repository contains the code for the research project "Adaptive Early Stopping Cross-Validation in AutoML using Dynamic Threshold". This work is based on the codebase from the paper "Don't Waste Your Time: Early Stopping for Cross-Validation" and its accompanying GitHub repository.
- Installation
- Main Entry Point:
e1.py - Available Commands
- Running Experiments
- Collecting Experiment Results
- Plotting Results with
plots.sh - Experiment Data
- Further Information
- License
To install the package, first clone/unpack the repository, cd into it and then run the following command in a virtual environment (Python 3.10 recommended):
pip install -r requirements.txt # Ensure the same versions of the packages
pip install -e . # Allow the package to be edited and usedYou can test that the installation was successful by running:
python e1.py --helpThis will list the available commands.
The primary script for interacting with this codebase is e1.py. It provides several commands to run experiments, collect results, and generate plots.
The core implementation of the early stopping methods, central to this project, can be found in src/exps/methods.py.
Below are the main commands available via python e1.py <command> --help:
Executes experiments locally.
- Usage:
python e1.py run --expname <experiment_name> - This command will run the specified set of experiments as defined in
e1.py.
Submits experiments to a SLURM cluster.
- Usage:
python e1.py submit --expname <experiment_name> - Useful for running large-scale experiments on a computing cluster.
If your SLURM cluster has a limit on the maximum number of jobs that can be submitted in a single array (e.g., 1000 jobs), you can use the --job-array-limit argument. The script will submit the workload in chunks and provide instructions for submitting subsequent chunks.
- Example:
python e1.py submit --expname <experiment_name> --job-array-limit 1000If you have more than 1000 experiments, this command will submit the first 1000. It will then print a command that you can use to submit the next chunk, which will include the--chunk-start-idxargument. You'll need to repeat this process until all jobs are submitted.
Checks the completion status of experiments.
- Usage:
python e1.py status --expname <experiment_name> - Helps in monitoring the progress of submitted or running experiments.
Gathers results from completed experiments and saves them into a Parquet file. This step is crucial before plotting.
- Usage:
python e1.py collect --expname <experiment_name> --out <output_file.parquet> - The output Parquet file aggregates data from all individual experiment runs within the specified experiment set.
Generates visualizations from the collected experiment results.
- Usage (
plot):python e1.py plot --kind <plot_kind> --input <collected_data.parquet> [other_options] - Usage (
plot-stacked):python e1.py plot-stacked --input <collected_data1.parquet> <collected_data2.parquet> [other_options] - These commands are used by the
bash/plots.shscript for generating standard plots for the experiments.
The core experiments for this project are:
category9-nsplits-10-dynamic: 10-fold Cross-Validation with MLP pipeline, including the "dynamic_adaptive_forgiving" early stopping strategy.category10-nsplits-10-dynamic: 10-fold Cross-Validation with RF pipeline, including the "dynamic_adaptive_forgiving" early stopping strategy.
To run these experiments locally:
# For the MLP experiments (10-fold)
python e1.py run --expname category9-nsplits-10-dynamic
# For the RF experiments (10-fold)
python e1.py run --expname category10-nsplits-10-dynamicIf you have a SLURM cluster, you can use the submit command instead of run.
After the experiments have finished, you need to collect their results. The bash/plots.sh script expects the collected data to be in specific files:
-
For
category9-nsplits-10-dynamic(MLP results):python e1.py collect --expname category9-nsplits-10-dynamic --out data/mlp-nsplits-10-dynamic.parquet
Ensure the
data/directory exists or is created. -
For
category10-nsplits-10-dynamic(RF results):python e1.py collect --expname category10-nsplits-10-dynamic --out data/rf-nsplits-10-dynamic.parquet
A convenience script bash/plots.sh is provided to generate the main plots for the category9-nsplits-10-dynamic and category10-nsplits-10-dynamic experiments.
Prerequisite: You must have collected the experiment results into data/mlp-nsplits-10-dynamic.parquet and data/rf-nsplits-10-dynamic.parquet as described above.
To generate the plots:
bash bash/plots.shThe plots will be saved in the plots/ directory.
The raw output for each experiment set defined in e1.py is stored in a corresponding results-<experiment_category> directory. For example:
- Data for
category9-nsplits-10-dynamiccan be found in theresults-category9/directory. - Data for
category10-nsplits-10-dynamiccan be found in theresults-category10/directory.
The collect command processes these raw files and aggregates them into the Parquet files (e.g., data/mlp-nsplits-10-dynamic.parquet) used for plotting.
The collected Parquet files (data/mlp-nsplits-10-dynamic.parquet and data/rf-nsplits-10-dynamic.parquet) are also available on Figshare: link
For more details on the original codebase structure, experiment definitions, and other functionalities, please refer to the README of the original repository.
This project is licensed under the BSD 3-Clause License. See the LICENSE file for details.