Add fp32 distance computation option in NN Descent #1415

jinsolp · 2025-10-08T17:59:46Z

Closes #1370
Closes #195

This PR adds an option to use fp32 distance computation.

(Outdated) From heuristics, chose dim=16 as the threshold for dispatching to a fp32 distance kernel.
We do manual computation, but since we only target small dimensions, fp32 dispatching ends up being slightly faster end to end with much better recall for small dimensions.

All number below are run on L40 machine and AMD EPYC CPU with 128 cores. Perf and recall is averaged over 5 runs and all time is in seconds. Baseline knn graph is computed using sklearn.neighbors.NearestNeighbors brute for method.

Max iters=20

For larger dimensions there is an inherent issue with the NN Descent algorithm itself that makes the recall low. This can be improved slightly with more iterations.
Also notice that the e2e time taken is similar or slightly less for using fp32.

Max iters=100

Notice how the blue part, the recall doesn't get better compared to the table above even with more iterations (i.e. why we need the fp32 appraoch for this part)

Perf impact on different architectures

H100

L40

jinsolp · 2025-11-18T23:41:41Z

Maybe we might need to have fp16 vs fp32 as an option (and default to fp16) instead of depending on data dimensions. WIP investigating

jinsolp · 2025-11-19T19:41:10Z

cpp/src/neighbors/detail/nn_descent.cuh

 }

+template <typename Index_t, typename Data_t, typename DistEpilogue_t>
+__device__ __forceinline__ void calculate_metric(float* s_distances,


This is taken out as a separate function from the original kernel because it's commonly used for fp32 and fp16

jinsolp · 2025-11-19T19:41:47Z

cpp/src/neighbors/detail/nn_descent.cuh

+      // this is much faster than a warp-collaborative multiplication because MAX_NUM_BI_SAMPLES is
+      // fixed and small (64)
+      for (int i = threadIdx.x; i < MAX_NUM_BI_SAMPLES * SKEWED_MAX_NUM_BI_SAMPLES;
+           i += blockDim.x) {
+        int tmp_row = i / SKEWED_MAX_NUM_BI_SAMPLES;
+        int tmp_col = i % SKEWED_MAX_NUM_BI_SAMPLES;
+        if (tmp_row < list_new_size && tmp_col < list_new_size) {
+          float acc = 0.0f;
+          for (int d = 0; d < num_load_elems; d++) {
+            acc += s_nv[tmp_row][d] * s_nv[tmp_col][d];
+          }
+          s_distances[i] += acc;
+        }
+      }


This matmul part is different from the fp16 kernel

divyegala

I would like to see a unit test, if possible, where you can demonstrably prove distance computation improvements in fp32 over fp16.

jinsolp · 2025-11-21T00:52:40Z

Thanks for the feedback @divyegala, changing this to target 26.02 for now because Corey suggested we do further investigation. Force-pushing after rebasing

jinsolp self-assigned this Oct 8, 2025

jinsolp requested a review from a team as a code owner October 8, 2025 17:59

jinsolp added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Oct 8, 2025

github-project-automation bot added this to Vector Search, ML, & Data Mining Release Board Oct 8, 2025

github-project-automation bot moved this to Todo in Vector Search, ML, & Data Mining Release Board Oct 8, 2025

jinsolp changed the base branch from main to release/25.12 November 17, 2025 17:04

jinsolp requested a review from a team as a code owner November 19, 2025 02:42

jinsolp changed the title ~~Dispatch to use fp32 distance computation in NN Descent depending on data dimensions~~ Add fp32 distance computation option in NN Descent Nov 19, 2025

jinsolp mentioned this pull request Nov 19, 2025

Place data on CPU memory for nn_descent option in HDBSCAN rapidsai/cuml#7506

Open

jinsolp commented Nov 19, 2025

View reviewed changes

jinsolp requested a review from divyegala November 20, 2025 01:07

divyegala reviewed Nov 20, 2025

View reviewed changes

jinsolp changed the base branch from release/25.12 to main November 21, 2025 00:52

jinsolp requested a review from a team as a code owner November 21, 2025 00:52

fp32 dis computation

5e8600d

jinsolp force-pushed the fix-nnd-recall-fp32 branch from 72bab3e to 5e8600d Compare November 21, 2025 02:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add fp32 distance computation option in NN Descent #1415

Add fp32 distance computation option in NN Descent #1415

Uh oh!

jinsolp commented Oct 8, 2025 •

edited

Loading

Uh oh!

jinsolp commented Nov 18, 2025

Uh oh!

jinsolp Nov 19, 2025

Uh oh!

jinsolp Nov 19, 2025

Uh oh!

divyegala left a comment

Uh oh!

jinsolp commented Nov 21, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add fp32 distance computation option in NN Descent #1415

Are you sure you want to change the base?

Add fp32 distance computation option in NN Descent #1415

Uh oh!

Conversation

jinsolp commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Max iters=20

Max iters=100

Perf impact on different architectures

H100

L40

Uh oh!

jinsolp commented Nov 18, 2025

Uh oh!

jinsolp Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

jinsolp Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

divyegala left a comment

Choose a reason for hiding this comment

Uh oh!

jinsolp commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jinsolp commented Oct 8, 2025 •

edited

Loading

jinsolp commented Nov 21, 2025 •

edited

Loading