Skip to content

cross_validate don't work with LightGBM v4.0.0 #112

Open
@yuta100101

Description

@yuta100101

Thanks for publishing such a useful tool!

A few days ago, LightGBM's new version 4.0.0 has been released.
In this release, early_stopping_rounds argument in fit() was removed.

So, functions that use cross_validate() such as run_experiment don't work.
(There may be other functions that don't work, I haven't investigated yet.)

Of cource, there is no probrem with versions before 3.3.5.

pytest log
(nyaggle) yuta100101:~/nyaggle(master =)$ pytest tests/validation/test_cross_validate.py::test_cv_lgbm
========================================================================================== test session starts ===========================================================================================
platform linux -- Python 3.9.17, pytest-7.4.0, pluggy-1.2.0
rootdir: /home/yuta100101/practice/nyaggle
collected 1 item                                                                                                                                                                                         

tests/validation/test_cross_validate.py F                                                                                                                                                          [100%]

================================================================================================ FAILURES ================================================================================================
______________________________________________________________________________________________ test_cv_lgbm ______________________________________________________________________________________________

    def test_cv_lgbm():
        X, y = make_classification(n_samples=1024, n_features=20, class_sep=0.98, random_state=0)
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
    
        models = [LGBMClassifier(n_estimators=300) for _ in range(5)]
    
>       pred_oof, pred_test, scores, importance = cross_validate(models, X_train, y_train, X_test, cv=5,
                                                                 eval_func=roc_auc_score,
                                                                 fit_params={'early_stopping_rounds': 200})

tests/validation/test_cross_validate.py:52: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

estimator = [LGBMClassifier(n_estimators=300), LGBMClassifier(n_estimators=300), LGBMClassifier(n_estimators=300), LGBMClassifier(n_estimators=300), LGBMClassifier(n_estimators=300)]
X_train =            0         1         2         3         4         5         6         7         8   ...        11        12... ... -0.109782 -0.412230  1.707714 -0.240937 -0.276747  0.481276 -0.278111  1.304773 -0.139538

[512 rows x 20 columns]
y = 0      0
1      0
2      0
3      1
4      0
      ..
507    0
508    1
509    0
510    1
511    0
Name: target, Length: 512, dtype: int64
X_test =            0         1         2         3         4         5         6         7         8   ...        11        12... ... -2.598922 -0.351561  0.233836 -1.873634 -1.089221  0.373956 -0.520939 -0.489945  2.452996

[512 rows x 20 columns]
cv = KFold(n_splits=5, random_state=0, shuffle=True), groups = None, eval_func = <function roc_auc_score at 0x7fe910196ee0>, logger = <Logger nyaggle.validation.cross_validate (WARNING)>
on_each_fold = None, fit_params = {'early_stopping_rounds': 200}, importance_type = 'gain', early_stopping = True, type_of_target = 'binary'

    def cross_validate(estimator: Union[BaseEstimator, List[BaseEstimator]],
                       X_train: Union[pd.DataFrame, np.ndarray], y: Union[pd.Series, np.ndarray],
                       X_test: Union[pd.DataFrame, np.ndarray] = None,
                       cv: Optional[Union[int, Iterable, BaseCrossValidator]] = None,
                       groups: Optional[pd.Series] = None,
                       eval_func: Optional[Callable] = None, logger: Optional[Logger] = None,
                       on_each_fold: Optional[Callable[[int, BaseEstimator, pd.DataFrame, pd.Series], None]] = None,
                       fit_params: Optional[Union[Dict[str, Any], Callable]] = None,
                       importance_type: str = 'gain',
                       early_stopping: bool = True,
                       type_of_target: str = 'auto') -> CVResult:
        """
        Evaluate metrics by cross-validation. It also records out-of-fold prediction and test prediction.
    
        Args:
            estimator:
                The object to be used in cross-validation. For list inputs, ``estimator[i]`` is trained on i-th fold.
            X_train:
                Training data
            y:
                Target
            X_test:
                Test data (Optional). If specified, prediction on the test data is performed using ensemble of models.
            cv:
                int, cross-validation generator or an iterable which determines the cross-validation splitting strategy.
    
                - None, to use the default ``KFold(5, random_state=0, shuffle=True)``,
                - integer, to specify the number of folds in a ``(Stratified)KFold``,
                - CV splitter (the instance of ``BaseCrossValidator``),
                - An iterable yielding (train, test) splits as arrays of indices.
            groups:
                Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g., ``GroupKFold``).
            eval_func:
                Function used for logging and returning scores
            logger:
                logger
            on_each_fold:
                called for each fold with (idx_fold, model, X_fold, y_fold)
            fit_params:
                Parameters passed to the fit method of the estimator
            importance_type:
                The type of feature importance to be used to calculate result.
                Used only in ``LGBMClassifier`` and ``LGBMRegressor``.
            early_stopping:
                If ``True``, ``eval_set`` will be added to ``fit_params`` for each fold.
                ``early_stopping_rounds = 100`` will also be appended to fit_params if it does not already have one.
            type_of_target:
                The type of target variable. If ``auto``, type is inferred by ``sklearn.utils.multiclass.type_of_target``.
                Otherwise, ``binary``, ``continuous``, or ``multiclass`` are supported.
        Returns:
            Namedtuple with following members
    
            * oof_prediction (numpy array, shape (len(X_train),)):
                The predicted value on put-of-Fold validation data.
            * test_prediction (numpy array, hape (len(X_test),)):
                The predicted value on test data. ``None`` if X_test is ``None``.
            * scores (list of float, shape (nfolds+1,)):
                ``scores[i]`` denotes validation score in i-th fold.
                ``scores[-1]`` is the overall score. `None` if eval is not specified.
            * importance (list of pandas DataFrame, shape (nfolds,)):
                ``importance[i]`` denotes feature importance in i-th fold model.
                If the estimator is not GBDT, empty array is returned.
    
        Example:
            >>> from sklearn.datasets import make_regression
            >>> from sklearn.linear_model import Ridge
            >>> from sklearn.metrics import mean_squared_error
            >>> from nyaggle.validation import cross_validate
    
            >>> X, y = make_regression(n_samples=8)
            >>> model = Ridge(alpha=1.0)
            >>> pred_oof, pred_test, scores, _ = \
            >>>     cross_validate(model,
            >>>                    X_train=X[:3, :],
            >>>                    y=y[:3],
            >>>                    X_test=X[3:, :],
            >>>                    cv=3,
            >>>                    eval_func=mean_squared_error)
            >>> print(pred_oof)
            [-101.1123267 ,   26.79300693,   17.72635528]
            >>> print(pred_test)
            [-10.65095894 -12.18909059 -23.09906427 -17.68360714 -20.08218267]
            >>> print(scores)
            [71912.80290003832, 15236.680239881942, 15472.822033121925, 34207.43505768073]
        """
        cv = check_cv(cv, y)
        n_output_cols = 1
        if type_of_target == 'auto':
            type_of_target = multiclass.type_of_target(y)
        if type_of_target == 'multiclass':
            n_output_cols = y.nunique(dropna=True)
    
        if isinstance(estimator, list):
            assert len(estimator) == cv.get_n_splits(), "Number of estimators should be same to nfolds."
    
        X_train = convert_input(X_train)
        y = convert_input_vector(y, X_train.index)
        if X_test is not None:
            X_test = convert_input(X_test)
    
        if not isinstance(estimator, list):
            estimator = [estimator] * cv.get_n_splits()
    
        assert len(estimator) == cv.get_n_splits()
    
        if logger is None:
            logger = getLogger(__name__)
    
        def _predict(model: BaseEstimator, x: pd.DataFrame, _type_of_target: str):
            if _type_of_target in ('binary', 'multiclass'):
                if hasattr(model, "predict_proba"):
                    proba = model.predict_proba(x)
                elif hasattr(model, "decision_function"):
                    warnings.warn('Since {} does not have predict_proba method, '
                                  'decision_function is used for the prediction instead.'.format(type(model)))
                    proba = model.decision_function(x)
                else:
                    raise RuntimeError('Estimator in classification problem should have '
                                       'either predict_proba or decision_function')
                if proba.ndim == 1:
                    return proba
                else:
                    return proba[:, 1] if proba.shape[1] == 2 else proba
            else:
                return model.predict(x)
    
        oof = np.zeros((len(X_train), n_output_cols)) if n_output_cols > 1 else np.zeros(len(X_train))
        evaluated = np.full(len(X_train), False)
        test = None
        if X_test is not None:
            test = np.zeros((len(X_test), n_output_cols)) if n_output_cols > 1 else np.zeros(len(X_test))
    
        scores = []
        eta_all = []
        importance = []
    
        for n, (train_idx, valid_idx) in enumerate(cv.split(X_train, y, groups)):
            start_time = time.time()
    
            train_x, train_y = X_train.iloc[train_idx], y.iloc[train_idx]
            valid_x, valid_y = X_train.iloc[valid_idx], y.iloc[valid_idx]
    
            if fit_params is None:
                fit_params_fold = {}
            elif callable(fit_params):
                fit_params_fold = fit_params(n, train_idx, valid_idx)
            else:
                fit_params_fold = copy.copy(fit_params)
    
            if is_gbdt_instance(estimator[n], ('lgbm', 'cat', 'xgb')):
                if early_stopping:
                    if 'eval_set' not in fit_params_fold:
                        fit_params_fold['eval_set'] = [(valid_x, valid_y)]
                    if 'early_stopping_rounds' not in fit_params_fold:
                        fit_params_fold['early_stopping_rounds'] = 100
    
>               estimator[n].fit(train_x, train_y, **fit_params_fold)
E               TypeError: fit() got an unexpected keyword argument 'early_stopping_rounds'

nyaggle/validation/cross_validate.py:177: TypeError
======================================================================================== short test summary info =========================================================================================
FAILED tests/validation/test_cross_validate.py::test_cv_lgbm - TypeError: fit() got an unexpected keyword argument 'early_stopping_rounds'
=========================================================================================== 1 failed in 1.90s ============================================================================================

<\details>

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions