Skip to content

Adding HannoyTransformer#141

Draft
amalia-k510 wants to merge 2 commits intoscikit-learn-contrib:mainfrom
amalia-k510:hannoy-implementation
Draft

Adding HannoyTransformer#141
amalia-k510 wants to merge 2 commits intoscikit-learn-contrib:mainfrom
amalia-k510:hannoy-implementation

Conversation

@amalia-k510
Copy link
Copy Markdown

@amalia-k510 amalia-k510 commented Apr 30, 2026

This PR adds HannoyTransformer as proposed in #123, wrapping hannoy (LMDB-backed storage) into the sklearn-ann transformer interface. The approach follows the same pattern as AnnoyTransformer and others.

A few things to note:

  • hannoy's Python bindings don't expose by_item yet (it exists in the Rust code in reader.rs but PyReader only has by_vec). A PR upstream was created to add it. Until that is approved, fit_transform stores the training data and re-queries it using by_vec instead of the faster by_item path. I plan to switch to by_item once it's available.

  • There's a known issue where multiple Database instances in the same process silently share the first LMDB environment.

  • I also opened an issue on hannoy requesting batch insert/query APIs to avoid the per-vector Python to Rust loop overhead.

Two questions: which metrics do we want to support? Right now it's only euclidean, but hannoy also has hamming, sqeuclidean, cosine, and manhattan. Also hannoy offers binary quantized variants of consine, euclidean, and manhattan. Would we want to also use those?

@flying-sheep
Copy link
Copy Markdown
Collaborator

flying-sheep commented Apr 30, 2026

There's a known issue where multiple Database instances in the same process silently share the first LMDB environment.

Do you mean a hannoy issue (if so, link plz) or in your code?

which metrics do we want to support?

All of them of course! Your code should not know which ones exist (except for the default): instead of accepting a string, just use the Metric enum as a type and pass that through to the upstream API.

Comment on lines +100 to +102
# distance correction
if self.metric == "euclidean":
np.sqrt(distances, out=distances)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants