Skip to content

Conversation

@jinsolp
Copy link
Contributor

@jinsolp jinsolp commented Oct 8, 2025

Closes #1370
Closes #195

This PR adds an option to use fp32 distance computation.

(Outdated) From heuristics, chose dim=16 as the threshold for dispatching to a fp32 distance kernel.
We do manual computation, but since we only target small dimensions, fp32 dispatching ends up being slightly faster end to end with much better recall for small dimensions.

All number below are run on L40 machine and AMD EPYC CPU with 128 cores. Perf and recall is averaged over 5 runs and all time is in seconds. Baseline knn graph is computed using sklearn.neighbors.NearestNeighbors brute for method.

Max iters=20

Screenshot 2025-10-08 at 10 56 17 AM

For larger dimensions there is an inherent issue with the NN Descent algorithm itself that makes the recall low. This can be improved slightly with more iterations.
Also notice that the e2e time taken is similar or slightly less for using fp32.

Max iters=100

Screenshot 2025-10-08 at 10 58 26 AM

Notice how the blue part, the recall doesn't get better compared to the table above even with more iterations (i.e. why we need the fp32 appraoch for this part)

Perf impact on different architectures

H100

Screenshot 2025-11-20 at 10 15 25 AM

L40

Screenshot 2025-11-20 at 10 15 50 AM

@jinsolp jinsolp self-assigned this Oct 8, 2025
@jinsolp jinsolp requested a review from a team as a code owner October 8, 2025 17:59
@jinsolp jinsolp added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Oct 8, 2025
@jinsolp jinsolp changed the base branch from main to release/25.12 November 17, 2025 17:04
@jinsolp
Copy link
Contributor Author

jinsolp commented Nov 18, 2025

Maybe we might need to have fp16 vs fp32 as an option (and default to fp16) instead of depending on data dimensions. WIP investigating

@jinsolp jinsolp requested a review from a team as a code owner November 19, 2025 02:42
@jinsolp jinsolp changed the title Dispatch to use fp32 distance computation in NN Descent depending on data dimensions Add fp32 distance computation option in NN Descent Nov 19, 2025
}

template <typename Index_t, typename Data_t, typename DistEpilogue_t>
__device__ __forceinline__ void calculate_metric(float* s_distances,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is taken out as a separate function from the original kernel because it's commonly used for fp32 and fp16

Comment on lines +659 to +672
// this is much faster than a warp-collaborative multiplication because MAX_NUM_BI_SAMPLES is
// fixed and small (64)
for (int i = threadIdx.x; i < MAX_NUM_BI_SAMPLES * SKEWED_MAX_NUM_BI_SAMPLES;
i += blockDim.x) {
int tmp_row = i / SKEWED_MAX_NUM_BI_SAMPLES;
int tmp_col = i % SKEWED_MAX_NUM_BI_SAMPLES;
if (tmp_row < list_new_size && tmp_col < list_new_size) {
float acc = 0.0f;
for (int d = 0; d < num_load_elems; d++) {
acc += s_nv[tmp_row][d] * s_nv[tmp_col][d];
}
s_distances[i] += acc;
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matmul part is different from the fp16 kernel

@jinsolp jinsolp requested a review from divyegala November 20, 2025 01:07
Copy link
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see a unit test, if possible, where you can demonstrably prove distance computation improvements in fp32 over fp16.

@jinsolp
Copy link
Contributor Author

jinsolp commented Nov 21, 2025

Thanks for the feedback @divyegala, changing this to target 26.02 for now because Corey suggested we do further investigation. Force-pushing after rebasing

@jinsolp jinsolp changed the base branch from release/25.12 to main November 21, 2025 00:52
@jinsolp jinsolp requested a review from a team as a code owner November 21, 2025 00:52
@jinsolp jinsolp force-pushed the fix-nnd-recall-fp32 branch from 72bab3e to 5e8600d Compare November 21, 2025 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Development

Successfully merging this pull request may close these issues.

[BUG] cuVS Nearest Neighbors recall lower than expected for some datasets [FEA] Calculating distances with FP32 in NN Descent

2 participants