Skip to content

Reproducibility bug/feature ? #31

@w1ll1a9m

Description

@w1ll1a9m

Describe the bug
Predicting an array of identical instances produces different predictions

To Reproduce
Steps to reproduce the behavior:

run this from the basic example:

from pgbm.sklearn import HistGradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
import numpy as np

X, y = fetch_california_housing(return_X_y=True)
# Train pgbm
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
x_test_single = X_test[0:1,:]
x_test_dup = np.tile(x_test_single, (10, 1))

# Train on set 
model = HistGradientBoostingRegressor(random_state=0)
model.fit(X_train, y_train)
#Point and probabilistic predictions. By default, 1 probabilistic estimates is created, so we create 100
yhat_point, yhat_point_std = model.predict(x_test_dup, return_std=True)
yhat_dist = model.sample(yhat_point, yhat_point_std, n_estimates=1000, random_state=1)

In this case I create the x_test_dup array which has the same instance duplicated 10 times
yhat_point and y_hat_point_std are then arrays of the same value 10 times
however, yhat_dist is an array of 10 different samples.

TLDR: the samples are different for an array with the same yhat_point, yhat_dist

I believe this is happening because the seed is being altered every time a sample is taken in:

/pgbm/sklearn/distributions

for j in prange(n_samples):
        np.random.seed(seed + j)

why would we want to have this feature ?

Expected behavior
I would expect to have the same predictions for instances with the same input feature set in an array.

Additional context
I am using the latest PGBM version on python 3.11

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions