Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to obtain internal statistics and parameters of an HNSW index #594

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mbautin
Copy link

@mbautin mbautin commented Oct 17, 2024

This allows us to diagnose potential configuration issues and compare the statistics of the internal structure of the index against other HNSW implementations.

Also allow specifying ef as a constructor parameter of HierarchicalNSW.

mbautin added a commit to yugabyte/yugabyte-db that referenced this pull request Oct 22, 2024
Summary:
Fixing some inconsistencies in index parameters that are causing a discrepancy between Usearch and Hnswlib performance:
- Correctly specifying connectivity for hnswlib as num_neighbors_per_vertex instead of max_neighbors_per_vertex.
- Passing the ef option into hnswlib configuration.

Adding internal statistics introspection to Usearch and Hnswlib index wrappers.

PR for hnswlib changes: nmslib/hnswlib#594.
PR for usearch changes: unum-cloud/usearch#508

Also allow specifying multiple values of k to pass in as input, as long as they are not greater than the precomputed ground truth result list size.

Updating hnsw_tool to always convert uint8_t coordinates to float32 when using Hnswlib to have a fair comparison with Usearch on the SIFT1B dataset. Usearch does not currently support the uint8_t type natively.

The changes to src/inline-thirdparty will be pushed as separate commits generated by `build-support/thirdparty_tool --sync-inline-thirdparty`.

Test Plan:
Jenkins

Manual testing using hnsw_tool

- hnswlib: https://gist.githubusercontent.com/mbautin/d21580dcac0b51ad2d7bc9fc130c5f9e/raw

```
    Hnswlib index with 5 levels
    max_elements: 1000000
    M: 16
    maxM: 16
    maxM0: 32
    ef_construction: 128
    ef: 10
    mult: 0.360674
    Level 0: 1000000 nodes, 21613828 edges, 21.61 average edges per node
    Level 1: 62323 nodes, 885027 edges, 14.20 average edges per node
    Level 2: 3855 nodes, 50515 edges, 13.10 average edges per node
    Level 3: 238 nodes, 2543 edges, 10.68 average edges per node
    Level 4: 17 nodes, 244 edges, 14.35 average edges per node
    Totals: 1066433 nodes, 22552157 edges, 21.15 average edges per node

    i-recall @ 50, i=1..10:

    1-recall @ 50: 0.9695000052
    2-recall @ 50: 0.9645000100
    3-recall @ 50: 0.9604333043
    4-recall @ 50: 0.9568499923
    5-recall @ 50: 0.9541400075
    6-recall @ 50: 0.9504333138
    7-recall @ 50: 0.9467428327
    8-recall @ 50: 0.9435999990
    9-recall @ 50: 0.9406333566
    10-recall @ 50: 0.9377999902
```

- usearch: https://gist.githubusercontent.com/mbautin/74948b310780562e74831eb29e43cb13/raw

```
    Usearch index with 4 levels
    connectivity: 16
    connectivity_base: 32
    expansion_add: 128
    expansion_search: 10
    inverse_log_connectivity: 0.360674
    Level 0: 1000000 nodes, 20973352 edges, 20.97 average edges per node
    Level 1: 64036 nodes, 890428 edges, 13.91 average edges per node
    Level 2: 5090 nodes, 66295 edges, 13.02 average edges per node
    Level 3: 481 nodes, 5304 edges, 11.03 average edges per node
    Totals: 1069607 nodes, 21935379 edges, 20.51 average edges per node

    i-recall@50, i=1..10:

    1-recall @ 40: 0.9305999875
    2-recall @ 40: 0.9201999903
    3-recall @ 40: 0.9141333103
    4-recall @ 40: 0.9085000157
    5-recall @ 40: 0.9036399722
    6-recall @ 40: 0.8987166882
    7-recall @ 40: 0.8932142854
    8-recall @ 40: 0.8890249729
    9-recall @ 40: 0.8852999806
    10-recall @ 40: 0.8813199997
```

Reviewers: sergei, aleksandr.ponomarenko

Reviewed By: sergei, aleksandr.ponomarenko

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D38977
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant