-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Short Question Description
What is the suggested way of using AutoSklearn in a nested CV setting, i.e. for example in combination with scikitlearn's cross_validate
?
Further details
The general setup/idea/goal:
AutoSklearn in a nested CV setup (inner CV), using Dask on a Slurm cluster for parallelisation and scikitlearn's cross_validate
for the outer CV.
The problem:
Combining these 3 aspects leads to a TypeError: cannot pickle '_asyncio.Task' object
error which doesn't appear when using Autosklearn just with fit() on a test data set instead of embedding it in the nested CV scenario.
A code snippet that should lead to the problem:
from dask_jobqueue import SLURMCluster
from dask.distributed import Client
import time
import sys
import logging
from sklearn import model_selection
from sklearn.datasets import make_regression
from sklearn.model_selection import ShuffleSplit, cross_validate
from autosklearn.regression import AutoSklearnRegressor
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.DEBUG)
# input arguments
cluster_jobs = 1
n_samples = 10000
t_ASL = 5 # in minutes
n_repeats_outer = 2
n_jobs_outer = 1
# Dask cluster and client
cluster = SLURMCluster(nanny=True) # other specifications in config file
cluster.scale(jobs=cluster_jobs)
client = Client(cluster)
# make fake regression data
X, y, true_weights = make_regression(
n_samples= n_samples,
n_features=1300,
n_informative=400,
noise=8,
coef=True,
random_state=0,
)
# further definitions
random_state = 43
# inner CV
cv_inner_folds = 2
# initialise ASL regressor
auto_model = AutoSklearnRegressor(
time_left_for_this_task=t_ASL*60, # multiplies by no. outer folds
memory_limit=400000,
resampling_strategy="cv",
resampling_strategy_arguments={"folds": cv_inner_folds},
metric=None, # None: Use default of each algorithm (not mean_squared_error)
n_jobs=1, # n_jobs is ignore when passing a dask client
initial_configurations_via_metalearning=0, # avoid config. warnings
dask_client=client,
tmp_folder=(
'/p/project/comanukb/vkomeyer/motorpred/ukb/code/Dask_ASL_minexpl/'
'logs_SlurmCluster/auto_sklearn_log1')
)
# outer CV
outer_cv = ShuffleSplit(
n_splits=n_repeats_outer, random_state=random_state,
train_size=0.7)
scoring_outer = ['r2']
# Model fitting
if __name__ == "__main__":
score = cross_validate(
estimator=model,
X=X, y=y,
cv=outer_cv,
scoring=scoring_outer,
return_estimator=True,
n_jobs=n_jobs_outer)
The resulting error:
TypeError: cannot pickle '_asyncio.Task' object
originating from cross_validate()
(I can provide you with the entire traceback if that was helpful)
Further info
The error doesn't occur neither when the nested cross validate setting is replace by a simple train/test split and auto_model.fit(X_train, y_train)
nor when using the cross validate setting without AutoSklearn but e.g. scikitlearn's GridSearchCV (comparison code can be provided in case that would be helpful).
Thank you very much for any hints to solve this!