Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducibility bug/feature ? #31

Open
w1ll1a9m opened this issue Dec 19, 2024 · 0 comments
Open

Reproducibility bug/feature ? #31

w1ll1a9m opened this issue Dec 19, 2024 · 0 comments

Comments

@w1ll1a9m
Copy link

w1ll1a9m commented Dec 19, 2024

Describe the bug
Predicting an array of identical instances produces different predictions

To Reproduce
Steps to reproduce the behavior:

run this from the basic example:

from pgbm.sklearn import HistGradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
import numpy as np

X, y = fetch_california_housing(return_X_y=True)
# Train pgbm
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
x_test_single = X_test[0:1,:]
x_test_dup = np.tile(x_test_single, (10, 1))

# Train on set 
model = HistGradientBoostingRegressor(random_state=0)
model.fit(X_train, y_train)
#Point and probabilistic predictions. By default, 1 probabilistic estimates is created, so we create 100
yhat_point, yhat_point_std = model.predict(x_test_dup, return_std=True)
yhat_dist = model.sample(yhat_point, yhat_point_std, n_estimates=1000, random_state=1)

In this case I create the x_test_dup array which has the same instance duplicated 10 times
yhat_point and y_hat_point_std are then arrays of the same value 10 times
however, yhat_dist is an array of 10 different samples.

TLDR: the samples are different for an array with the same yhat_point, yhat_dist

I believe this is happening because the seed is being altered every time a sample is taken in:

/pgbm/sklearn/distributions

for j in prange(n_samples):
        np.random.seed(seed + j)

why would we want to have this feature ?

Expected behavior
I would expect to have the same predictions for instances with the same input feature set in an array.

Additional context
I am using the latest PGBM version on python 3.11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant