Skip to content

[Question] How to best use AutoSklearn in a nested cross validation setting? #1570

@verakye

Description

@verakye

Short Question Description

What is the suggested way of using AutoSklearn in a nested CV setting, i.e. for example in combination with scikitlearn's cross_validate?

Further details

The general setup/idea/goal:

AutoSklearn in a nested CV setup (inner CV), using Dask on a Slurm cluster for parallelisation and scikitlearn's cross_validate for the outer CV.

The problem:

Combining these 3 aspects leads to a TypeError: cannot pickle '_asyncio.Task' object error which doesn't appear when using Autosklearn just with fit() on a test data set instead of embedding it in the nested CV scenario.

A code snippet that should lead to the problem:

from dask_jobqueue import SLURMCluster
from dask.distributed import Client
import time
import sys
import logging
from sklearn import model_selection
from sklearn.datasets import make_regression
from sklearn.model_selection import ShuffleSplit, cross_validate
from autosklearn.regression import AutoSklearnRegressor
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.DEBUG)

# input arguments
cluster_jobs = 1
n_samples = 10000
t_ASL = 5  # in minutes
n_repeats_outer = 2
n_jobs_outer = 1

# Dask cluster and client
cluster = SLURMCluster(nanny=True)  # other specifications in config file
cluster.scale(jobs=cluster_jobs)
client = Client(cluster)

# make fake regression data
X, y, true_weights = make_regression(
    n_samples= n_samples,
    n_features=1300,
    n_informative=400,
    noise=8,
    coef=True,
    random_state=0,
)

# further definitions
random_state = 43

# inner CV
cv_inner_folds = 2

# initialise ASL regressor
auto_model = AutoSklearnRegressor(
        time_left_for_this_task=t_ASL*60,  # multiplies by no. outer folds
        memory_limit=400000,
        resampling_strategy="cv",
        resampling_strategy_arguments={"folds": cv_inner_folds},
        metric=None,  # None: Use default of each algorithm (not mean_squared_error)
        n_jobs=1,  # n_jobs is ignore when passing a dask client
        initial_configurations_via_metalearning=0,  # avoid config. warnings
        dask_client=client,
        tmp_folder=(
            '/p/project/comanukb/vkomeyer/motorpred/ukb/code/Dask_ASL_minexpl/'
            'logs_SlurmCluster/auto_sklearn_log1')
    )

# outer CV
outer_cv = ShuffleSplit(
    n_splits=n_repeats_outer, random_state=random_state,
    train_size=0.7)
scoring_outer = ['r2']

# Model fitting
if __name__ == "__main__":
    score = cross_validate(
        estimator=model,
        X=X, y=y,
        cv=outer_cv,
        scoring=scoring_outer,
        return_estimator=True,
        n_jobs=n_jobs_outer)

The resulting error:

TypeError: cannot pickle '_asyncio.Task' objectoriginating from cross_validate() (I can provide you with the entire traceback if that was helpful)

Further info

The error doesn't occur neither when the nested cross validate setting is replace by a simple train/test split and auto_model.fit(X_train, y_train) nor when using the cross validate setting without AutoSklearn but e.g. scikitlearn's GridSearchCV (comparison code can be provided in case that would be helpful).

Thank you very much for any hints to solve this!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions