Adding HannoyTransformer#141
Draft
amalia-k510 wants to merge 2 commits intoscikit-learn-contrib:mainfrom
Draft
Conversation
for more information, see https://pre-commit.ci
Collaborator
Do you mean a hannoy issue (if so, link plz) or in your code?
All of them of course! Your code should not know which ones exist (except for the default): instead of accepting a string, just use the |
Comment on lines
+100
to
+102
| # distance correction | ||
| if self.metric == "euclidean": | ||
| np.sqrt(distances, out=distances) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds
HannoyTransformeras proposed in #123, wrapping hannoy (LMDB-backed storage) into the sklearn-ann transformer interface. The approach follows the same pattern asAnnoyTransformerand others.A few things to note:
hannoy's Python bindings don't expose
by_itemyet (it exists in the Rust code inreader.rsbutPyReaderonly hasby_vec). A PR upstream was created to add it. Until that is approved,fit_transformstores the training data and re-queries it usingby_vecinstead of the fasterby_itempath. I plan to switch toby_itemonce it's available.There's a known issue where multiple
Databaseinstances in the same process silently share the first LMDB environment.I also opened an issue on hannoy requesting batch insert/query APIs to avoid the per-vector Python to Rust loop overhead.
Two questions: which metrics do we want to support? Right now it's only
euclidean, but hannoy also hashamming,sqeuclidean,cosine, andmanhattan. Also hannoy offers binary quantized variants of consine, euclidean, and manhattan. Would we want to also use those?