Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf_psi fails because of score binning and missing samples #117

Open
Migalvao opened this issue Jan 21, 2025 · 0 comments
Open

perf_psi fails because of score binning and missing samples #117

Migalvao opened this issue Jan 21, 2025 · 0 comments

Comments

@Migalvao
Copy link

When calling function perf_psi() with show_plot=True, i got this error:

IndexError: single positional indexer is out-of-bounds
on distr_prob.distr.iloc[:,1] in line 532:

       530 # ax1
       531 p1 = ax1.bar(ind, distr_prob.distr.iloc[:,0], width, color=(24/254, 192/254, 196/254), alpha=0.6)
-->    532 p2 = ax1.bar(ind+width, distr_prob.distr.iloc[:,1], width, color=(246/254, 115/254, 109/254), alpha=0.6)
       533 # ax2
       534 p3 = ax2.plot(ind+width/2, distr_prob.badprob.iloc[:,0], color=(24/254, 192/254, 196/254))

and right before that a warning regarding a division by zero (RuntimeWarning: divide by zero encountered in log).

Looking at the binning created by this function, by printing the return value with return_distr_dat=True:

{'psi':   
    variable  PSI
0    score  inf, 
'pic': {}, 
'dat': {'score':          
         bin       N           badprob          
ae               test   train      test     train
0   [300,350)     NaN     1.0       NaN  0.000000
1   [350,400)  6257.0  6216.0  0.560332  0.563224
2   [400,500)  2733.0  2775.0  0.361873  0.358559
}}

We see that the first bin shows only one sample for the training set and no samples for the test set, hence, probably, the error, since in the following part of the code we see that when pivoting the table in line 518, if there are no samples for a certain bin in one of the sets, there is no row to pivot.

511        distr_prob = dat.groupby(['ae', 'bin'])\
512          ['y'].agg([good, bad])\
513          .assign(N=lambda x: x.good+x.bad,
514            badprob=lambda x: x.bad/(x.good+x.bad)
515          ).reset_index()
516        distr_prob.loc[:,'distr'] = distr_prob.groupby('ae')['N'].transform(lambda x:x/sum(x))
517        # pivot table
518        distr_prob = distr_prob.pivot_table(values=['N','badprob', 'distr'], index='bin', columns='ae')

Therefore, there will only be one column, leading to the indexer being out-of-bounds.

In case I got it correctly, I suggest either enforcing an empty record in case there is a bin with no samples, adjusting the bins so that there are always samples or even perhaps allowing for custom bins to be used.

Otherwise, I would appreciate your support in this matter. Thanks you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant