Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip #5

Open
wants to merge 1 commit into
base: 8057
Choose a base branch
from
Open

wip #5

wants to merge 1 commit into from

Conversation

massich
Copy link
Owner

@massich massich commented Sep 14, 2017

This is a PR regarding this concern by @glemaitre.

The main concern is that in order to do a check into a sparse matrix, this needs to be materialized and @glemaitre is wandering if it could be computed directly from the sparse matrix. The underlying computation is this one:

mode = np.array([stats.mode(pl)[0] for pl in pred_labels], dtype=np.int)

which could be changed (for the sparse case) by something this:

def get_sparse_matrix_mode (x_sparse):
      mode, occ = stats.mode(x_sparse.data)
      n_of_zero_values = np.product(x_sparse.shape) - x_spase.nnz
      return mode if occ > n_of_zero_values else 0

mode = np.array([get_sparse_matrix_mode(pl) for pl in pred_labels])

The problem comes when the instances of 0 need to be waited. Which is the case here. And any feedback is wellcome.

Some notes, that I don't know how to include into the discussion but that are important when taking a decission. (They are listed with no particular order).
1 - The case of sparase matrix and weights is never tested. See this breakpoint and travis still all green.

2 - Two different signatures of self._mode: KNeighborsClassifier::_mode (self,neigh,weights) and RadiusNeighborsClassifier::_mode (self,pred_labels,weights,inliers). More over neither of them use self inside. So shouldn't we unify the signature and use it as a free function. Or in case of really being a class method shouldn't they have the same signature and be added to a parent class?

3 - In order to unify the call and simply the code weights could always be provided (at expenses of some computing time) and we could even add sparse support to sklearn.utils.extmath.weighted_mode:

>>> from scipy import sparse
>>> import numpy as np
>>> from sklearn.utils.extmath import weighted_mode
>>> from scipy.stats import mode

>>> x = np.random.random((1000,))
>>> y = np.ones((1000,))
>>> %timeit weighted_mode(x,y)
30.6 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit mode(x)
25.1 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> x = np.random.random((100000,))
>>> y = np.ones((100000,))
>>> %timeit weighted_mode(x,y)
30.1 s ± 2.22 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit mode(x)
15.5 s ± 90.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@massich massich force-pushed the 8057_glemaitre_feedback branch from c54274b to d467359 Compare September 14, 2017 11:56
@pep8speaks
Copy link

pep8speaks commented Sep 14, 2017

Hello @massich! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on September 14, 2017 at 12:23 Hours UTC

@massich massich force-pushed the 8057_glemaitre_feedback branch from d467359 to a03237d Compare September 14, 2017 12:23
@jnothman
Copy link

I've not understood your entire discussion in a hurry, but:

  • in kneighbors the data being operated on is rectangular. In radius_neighbors it is not.
  • if we keep things sparse in the radius_neighbors case when calculating the mode, it is operating on a CSR or CSC matrix where one dimension is 1.
  • if the data is binary and the matrix canonically has any zeroes eliminated, then the mode can be calculated from shape and nnz alone.
  • if zeros are not eliminated, you need bincount to calculate the mode (although if it's binary, you can also calculate it from data.mean() but that's tricky and unreadable)

I don't mind if we explicitly calculate the sparse mode, but as long as avg_neighbours << n_classes, it's not helping much.

@jnothman
Copy link

jnothman commented Sep 14, 2017

In practice, we have little-to-no testing of scikit-learn with sparse matrices where zeros haven't been eliminated. But I try to keep it in mind anyway...

@massich massich force-pushed the 8057 branch 2 times, most recently from 2192eb7 to c4a47e3 Compare September 15, 2017 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants