Adaptive Early Stopping Cross-Validation in AutoML using Dynamic Threshold

This repository contains the code for the research project "Adaptive Early Stopping Cross-Validation in AutoML using Dynamic Threshold". This work is based on the codebase from the paper "Don't Waste Your Time: Early Stopping for Cross-Validation" and its accompanying GitHub repository.

Installation

To install the package, first clone/unpack the repository, cd into it and then run the following command in a virtual environment (Python 3.10 recommended):

pip install -r requirements.txt  # Ensure the same versions of the packages
pip install -e .  # Allow the package to be edited and used

You can test that the installation was successful by running:

python e1.py --help

This will list the available commands.

Main Entry Point: `e1.py`

The primary script for interacting with this codebase is e1.py. It provides several commands to run experiments, collect results, and generate plots.

The core implementation of the early stopping methods, central to this project, can be found in src/exps/methods.py.

Available Commands

Below are the main commands available via python e1.py <command> --help:

1. `run`

Executes experiments locally.

Usage: python e1.py run --expname <experiment_name>
This command will run the specified set of experiments as defined in e1.py.

2. `submit`

Submits experiments to a SLURM cluster.

Usage: python e1.py submit --expname <experiment_name>
Useful for running large-scale experiments on a computing cluster.

Handling SLURM Job Array Limits

If your SLURM cluster has a limit on the maximum number of jobs that can be submitted in a single array (e.g., 1000 jobs), you can use the --job-array-limit argument. The script will submit the workload in chunks and provide instructions for submitting subsequent chunks.

Example: python e1.py submit --expname <experiment_name> --job-array-limit 1000 If you have more than 1000 experiments, this command will submit the first 1000. It will then print a command that you can use to submit the next chunk, which will include the --chunk-start-idx argument. You'll need to repeat this process until all jobs are submitted.

3. `status`

Checks the completion status of experiments.

Usage: python e1.py status --expname <experiment_name>
Helps in monitoring the progress of submitted or running experiments.

4. `collect`

Gathers results from completed experiments and saves them into a Parquet file. This step is crucial before plotting.

Usage: python e1.py collect --expname <experiment_name> --out <output_file.parquet>
The output Parquet file aggregates data from all individual experiment runs within the specified experiment set.

5. `plot` / `plot-stacked`

Generates visualizations from the collected experiment results.

Usage (plot): python e1.py plot --kind <plot_kind> --input <collected_data.parquet> [other_options]
Usage (plot-stacked): python e1.py plot-stacked --input <collected_data1.parquet> <collected_data2.parquet> [other_options]
These commands are used by the bash/plots.sh script for generating standard plots for the experiments.

Running Experiments

The core experiments for this project are:

category9-nsplits-10-dynamic: 10-fold Cross-Validation with MLP pipeline, including the "dynamic_adaptive_forgiving" early stopping strategy.
category10-nsplits-10-dynamic: 10-fold Cross-Validation with RF pipeline, including the "dynamic_adaptive_forgiving" early stopping strategy.

To run these experiments locally:

# For the MLP experiments (10-fold)
python e1.py run --expname category9-nsplits-10-dynamic

# For the RF experiments (10-fold)
python e1.py run --expname category10-nsplits-10-dynamic

If you have a SLURM cluster, you can use the submit command instead of run.

Collecting Experiment Results

After the experiments have finished, you need to collect their results. The bash/plots.sh script expects the collected data to be in specific files:

For category9-nsplits-10-dynamic (MLP results):

python e1.py collect --expname category9-nsplits-10-dynamic --out data/mlp-nsplits-10-dynamic.parquet

Ensure the data/ directory exists or is created.

For category10-nsplits-10-dynamic (RF results):

python e1.py collect --expname category10-nsplits-10-dynamic --out data/rf-nsplits-10-dynamic.parquet

Plotting Results with `plots.sh`

A convenience script bash/plots.sh is provided to generate the main plots for the category9-nsplits-10-dynamic and category10-nsplits-10-dynamic experiments.

Prerequisite: You must have collected the experiment results into data/mlp-nsplits-10-dynamic.parquet and data/rf-nsplits-10-dynamic.parquet as described above.

To generate the plots:

bash bash/plots.sh

The plots will be saved in the plots/ directory.

Experiment Data

The raw output for each experiment set defined in e1.py is stored in a corresponding results-<experiment_category> directory. For example:

Data for category9-nsplits-10-dynamic can be found in the results-category9/ directory.
Data for category10-nsplits-10-dynamic can be found in the results-category10/ directory.

The collect command processes these raw files and aggregates them into the Parquet files (e.g., data/mlp-nsplits-10-dynamic.parquet) used for plotting.

The collected Parquet files (data/mlp-nsplits-10-dynamic.parquet and data/rf-nsplits-10-dynamic.parquet) are also available on Figshare: link

Further Information

For more details on the original codebase structure, experiment definitions, and other functionalities, please refer to the README of the original repository.

License

This project is licensed under the BSD 3-Clause License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
bash		bash
md		md
misc		misc
notebooks		notebooks
src/exps		src/exps
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
e1.py		e1.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adaptive Early Stopping Cross-Validation in AutoML using Dynamic Threshold

Table of Contents

Installation

Main Entry Point: `e1.py`

Available Commands

1. `run`

2. `submit`

Handling SLURM Job Array Limits

3. `status`

4. `collect`

5. `plot` / `plot-stacked`

Running Experiments

Collecting Experiment Results

Plotting Results with `plots.sh`

Experiment Data

Further Information

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Adaptive Early Stopping Cross-Validation in AutoML using Dynamic Threshold

Table of Contents

Installation

Main Entry Point: e1.py

Available Commands

1. run

2. submit

Handling SLURM Job Array Limits

3. status

4. collect

5. plot / plot-stacked

Running Experiments

Collecting Experiment Results

Plotting Results with plots.sh

Experiment Data

Further Information

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Main Entry Point: `e1.py`

1. `run`

2. `submit`

3. `status`

4. `collect`

5. `plot` / `plot-stacked`

Plotting Results with `plots.sh`

Packages